Google Embedding 2 (specifically, Gemini Embedding 2) primarily outputs dense numerical vectors, known as embeddings, that represent the semantic meaning of various input modalities. Unlike its predecessors which were often text-only, Gemini Embedding 2 is a multimodal model capable of processing and unifying text, images, videos, audio, and documents into a shared embedding space. This means that regardless of the input type—be it a paragraph of text, a picture, a short video clip, an audio recording, or a PDF document up to six pages—the model converts it into a high-dimensional vector. These vectors capture the underlying meaning and relationships of the input data, allowing for comparisons and computations that go beyond simple keyword matching. The default output dimension for these vectors is 3,072, providing a rich representation of the input’s content.
A key output feature of Gemini Embedding 2 is its support for Matryoshka Representation Learning (MRL), which enables developers to adjust the dimensionality of the output embeddings. While the default is 3,072 dimensions, the model can generate smaller vectors, such as 1,536 or 768 dimensions, without significant loss of quality. This flexibility is crucial for managing storage and computational costs, especially in large-scale applications. For instance, reducing the embedding size can lead to smaller vector database indexes and faster similarity searches, which is highly beneficial when working with platforms like Milvus for efficient vector retrieval and comparison. This adjustable output dimension allows for optimization based on specific use cases and resource constraints.
These generated embeddings serve as the foundation for a wide range of AI tasks. For example, they are fundamental for Retrieval-Augmented Generation (RAG) systems, semantic search, sentiment analysis, and data clustering. By mapping different data types into a unified semantic space, Gemini Embedding 2 facilitates cross-modal understanding and retrieval. Developers can feed a query in one modality (e.g., text) and retrieve relevant information from another (e.g., an image or video) because all data is represented by comparable numerical vectors. The model also supports interleaved multimodal inputs, meaning it can process combinations like an image and associated text in a single request, producing a single embedding that encapsulates the context across these diverse media types. This capability simplifies complex data processing pipelines and enhances the accuracy of multimodal applications.