Google Embedding 2 refers to Gemini Embedding 2, a significant advancement in Google’s embedding models. Released in public preview around March 2026, it is Google’s first natively multimodal embedding model. This means it can map various data types—text, images, videos, audio, and documents—into a single, unified embedding space. This capability allows for a more comprehensive understanding of relationships across different media, unlike previous models that often specialized in text or required separate processing pipelines for each modality. Gemini Embedding 2 is built on the Gemini architecture, inheriting its advanced multimodal understanding.
A key feature of Gemini Embedding 2 is its multimodal input support. It can process text up to 8,192 input tokens, handle up to six images per request (PNG and JPEG formats), support videos up to 120 seconds (MP4 and MOV formats), natively process audio without requiring transcription, and embed PDF files up to six pages. Crucially, it supports interleaved multimodal inputs, meaning developers can combine different media types, such as an image with its text caption, in a single request. This allows the model to capture intricate relationships between modalities within the combined input, generating a single embedding that represents the joint meaning. The model also incorporates Matryoshka Representation Learning (MRL), which allows embedding vectors to be scaled across different dimensions (e.g., from a default of 3,072 to 1,536 or 768) while retaining usefulness, balancing performance and storage costs.
Gemini Embedding 2 is designed to enhance a wide range of AI tasks, including Retrieval-Augmented Generation (RAG), semantic search, sentiment analysis, and data clustering. Its ability to create unified embeddings across modalities simplifies complex AI pipelines and improves the accuracy and contextual richness of results. Developers can access Gemini Embedding 2 through the Gemini API and Vertex AI, integrating it with various frameworks and vector database tools. When integrating with vector databases like Milvus, these multimodal embeddings can power advanced similarity search and data retrieval across diverse content types, facilitating efficient low-latency retrieval, especially as data scales.