Milvus
Zilliz

What are the basic inputs for Google embedding 2?

Google Embedding 2, officially known as Gemini Embedding 2, is a natively multimodal embedding model designed to process and represent diverse types of data in a unified vector space. The basic inputs it accepts include text, images, video, audio, and documents. This allows for a broad range of applications, from semantic search to retrieval-augmented generation (RAG) systems. Unlike previous models that often required separate processing pipelines for different data types, Gemini Embedding 2 simplifies the workflow by handling multiple modalities within a single framework.

Specifically, the model accommodates text inputs with a context window of up to 8,192 tokens per request, significantly expanding the capacity compared to earlier text-only embedding models. For visual content, it can process up to six images per request, supporting common formats like PNG and JPEG. Video inputs are accepted up to 120 seconds in length, with support for MP4 and MOV formats, and audio can be ingested natively for up to 80 seconds (MP3, WAV, etc.) without requiring a separate transcription step. Furthermore, Gemini Embedding 2 can directly embed PDF documents, handling up to six pages per file.

A key capability of Gemini Embedding 2 is its support for interleaved inputs, meaning developers can combine different modalities within a single request, such as an image alongside a text description. This feature allows the model to capture complex relationships and semantic meaning across various media types, producing a single, cohesive embedding that represents the combined input. These generated high-dimensional vectors can then be stored and queried in specialized databases, such as a vector database like Milvus, to enable efficient similarity searches, clustering, and classification across multimodal datasets.

Like the article? Spread the word