Milvus
Zilliz

What are the latency characteristics of Google embedding 2?

Google Embedding 2 demonstrates significantly improved latency characteristics primarily due to its natively multimodal architecture. Unlike traditional approaches that require separate models and intermediate steps to process different data types (such as transcribing audio to text before embedding), Gemini Embedding 2 processes text, images, video, audio, and PDFs directly within a single unified framework. This eliminates the “loss in translation,” latency, and increased operational costs associated with fragmented AI pipelines. For instance, early access partners like Sparkonomy reported a 70% reduction in latency by removing intermediate Large Language Model (LLM) inference steps, which were previously necessary for handling multimodal data. This architectural shift allows for a more streamlined and efficient embedding generation process, as the model can capture complex, nuanced relationships between different media types in a single API call, leading to faster semantic understanding and retrieval in various applications.

A key feature influencing performance and indirectly affecting latency in downstream applications is Matryoshka Representation Learning (MRL). Gemini Embedding 2 generates high-dimensional embeddings (defaulting to 3072 dimensions) but also allows developers to dynamically scale down these output dimensions to 1536 or 768 without retraining the model. While the embedding generation latency itself is not explicitly detailed with specific numbers, smaller embedding dimensions generally translate to lower storage requirements and faster retrieval times in vector databases like Milvus. This flexibility enables developers to balance embedding quality with infrastructure costs and retrieval latency, tailoring the solution to their specific performance needs. For example, smaller vectors can be used for a fast initial pass in a two-stage retrieval pattern, followed by re-ranking with full-dimension vectors for higher accuracy.

For workloads that prioritize throughput and are more tolerant to latency, Google offers support for Gemini Embedding models through its Batch API. This option allows for processing embeddings at a reduced cost, suggesting that batch operations are available for scenarios where real-time inference might not be the primary concern, or where aggregating requests can lead to overall system efficiency gains. The improvements in latency are largely a result of the model’s unified approach to multimodal data, reducing the need for complex preprocessing and multi-stage inference that previously introduced bottlenecks in multimodal AI systems.

Like the article? Spread the word