Google Embedding 2, specifically referred to as Gemini Embedding 2, is built directly upon the Gemini foundation model. This means that its training data is inherently derived from and reflective of the extensive and diverse datasets used to train the larger Gemini multimodal model. Consequently, Gemini Embedding 2 inherits a robust multimodal understanding from the ground up, enabling it to process and understand a wide array of data types within a unified framework.
The training data for Gemini Embedding 2 includes a broad spectrum of modalities, designed to allow the model to capture semantic intent across different forms of information. This encompasses text, which can handle up to 8,192 input tokens and supports over 100 languages. In terms of visual data, it processes images (up to six per request in PNG or JPEG formats). For dynamic content, the model is trained on video, supporting clips up to 120 seconds in MP4 and MOV formats, and it handles audio natively without requiring intermediate transcription. Additionally, it can embed documents, specifically PDFs up to six pages long, with OCR capabilities to read text from these documents. This native multimodal training allows for deep cross-modal connections, as opposed to aligning separate encoders at the end of the process.
While the specific proprietary datasets used by Google for the Gemini foundation model are not publicly detailed, the principles and types of data are consistent with high-performance multimodal models. For instance, an open embedding model from Google, EmbeddingGemma (which shares research and technology with Gemini models), explicitly states its training data includes “Web Documents,” “Code and Technical Documents,” and “Synthetic and Task-Specific Data” across more than 100 languages. This indicates a comprehensive approach to data collection, aimed at exposing the models to a vast range of linguistic styles, topics, programming structures, and specialized content. The unified embedding space generated by Gemini Embedding 2 allows various data types to be compared directly, streamlining applications such as semantic search, Retrieval-Augmented Generation (RAG), and data clustering, which can be further optimized by storing these embeddings in vector databases like Milvus for efficient retrieval.