Google’s “Embedding 2” primarily refers to Gemini Embedding 2, a significant advancement in their embedding technology. It is Google’s first natively multimodal embedding model, designed to transform diverse data types—including text, images, videos, audio, and PDF documents—into a single, unified vector space. This allows for direct comparison and understanding of semantic relationships across different forms of media, moving beyond traditional text-only embedding models. The core function of Gemini Embedding 2 is to represent the meaning of various inputs as high-dimensional numerical vectors, where the proximity of these vectors in the embedding space indicates semantic similarity, regardless of the original data type.
Technically, Gemini Embedding 2 is built upon the foundational Gemini architecture, leveraging its advanced multimodal understanding capabilities. A key technical feature is its ability to process multiple modalities in a single request, creating a unified embedding that captures the meaning across all input types. For instance, it can process text up to 8,192 tokens, up to six images (PNG/JPEG) per request, videos up to 120 seconds, and PDFs up to six pages. Notably, it handles audio natively without requiring a transcription step, which often leads to information loss in other models. The model also employs Matryoshka Representation Learning (MRL), a technique that allows developers to scale down the output dimensions dynamically from a default of 3,072 to smaller sizes like 1,536 or 768. This flexibility enables a trade-off between embedding quality and computational/storage costs, making it adaptable for various use cases. The resulting dense vector representations capture nuanced semantic and contextual information, which is crucial for modern AI applications.
These multimodal embeddings from Gemini Embedding 2 are vital for developing sophisticated AI applications. They enable functionalities such as multimodal semantic search, where a text query can retrieve relevant images or videos, or vice versa. This also extends to Retrieval-Augmented Generation (RAG) workflows, clustering of diverse content, and classification across different media types, enhancing the accuracy and richness of AI systems. For developers working with large datasets of multimodal information, these embeddings can be stored and indexed efficiently in vector databases, such as Milvus. Using a vector database allows for rapid similarity searches, enabling applications to quickly find semantically related content across text, images, audio, and video, thereby powering more intelligent and context-aware user experiences.