Milvus
Zilliz

Does Google embedding 2 support multimodal data?

Yes, Google’s Gemini Embedding 2 model explicitly supports multimodal data, marking a significant advancement over previous text-only embedding systems. This model is engineered to map various data types—including text, images, videos, audio, and documents (specifically PDFs)—into a single, unified embedding space. This capability allows for a comprehensive understanding of semantic intent across these diverse modalities and supports over 100 languages. By consolidating different data types into one embedding space, Gemini Embedding 2 simplifies complex AI pipelines and enhances multimodal downstream tasks such as Retrieval-Augmented Generation (RAG), semantic search, sentiment analysis, and data clustering.

Technically, Gemini Embedding 2 leverages the multimodal architecture of the Gemini foundation model, inheriting its ability to understand and process information from multiple sources natively. Unlike traditional approaches that might process each modality separately and then attempt to align them, Gemini Embedding 2 generates deeply integrated cross-modal connections from the outset. This allows for interleaved multimodal inputs, meaning developers can combine data types like an image with its text caption or a video clip with spoken narration in a single request. The model has specific input limits: up to 8,192 input tokens for text, processing up to six images (PNG and JPEG formats) per request, handling videos up to 120 seconds (MP4 and MOV formats), directly ingesting audio data without the need for transcription, and embedding PDFs up to six pages long.

The unified embedding space generated by Gemini Embedding 2 allows for direct comparison and analysis of different media types, which is crucial for applications requiring a holistic understanding of data. For instance, an image and its textual description, or a video and its audio track, can be represented by a single vector, enabling more nuanced and context-aware retrieval. These generated embeddings are typically 3072-dimensional vectors by default, though adjustable dimensions are available for managing storage and performance. Such multimodal embeddings can be stored and indexed efficiently in a vector database such as Milvus, which is optimized for similarity search on high-dimensional vectors, thereby enabling rapid retrieval of relevant multimodal content based on semantic similarity.

Like the article? Spread the word