Will Gemini Embedding 2 Kill Multi-Vector Search in Vector Databases?
Google released Gemini Embedding 2 â the first multimodal embedding model that maps text, images, video, audio, and documents into a single vector space.
You embed a video clip, a product photo, and a paragraph of text with one API call, and they all land in the same semantic neighborhood.
Before models like this, you had no choice but to run each modality through its own specialist model and store each output in a separate vector column. Multi-vector columns in vector databases like Milvus were built precisely for that world.
So the question now is: how much of that complexity can Gemini Embedding 2 replace â and where does it fall short? This post walks through where each approach fits and how they work together.
Whatâs Different About Gemini Embedding 2 When Compared to CLIP/CLAP
Embedding models convert unstructured data into dense vectors so that semantically similar items cluster together in vector space. What makes Gemini Embedding 2 different is that it does this natively across modalities, with no separate models and no stitching pipelines.
Until now, multimodal embeddings meant dual-encoder models trained with contrastive learning: CLIP for image-text, CLAP for audio-text, each handling exactly two modalities. If you needed all three, you ran multiple models and coordinated their embedding spaces yourself.
For example, indexing a podcast with cover art meant running CLIP for the image, CLAP for the audio, and a text encoder for the transcript â three models, three vector spaces, and custom fusion logic to make their scores comparable at query time.
In contrast, according to Googleâs official announcement, here is what Gemini Embedding 2 supports:
- Text up to 8,192 tokens per request
- Images up to 6 per request (PNG, JPEG)
- Video up to 120 seconds (MP4, MOV)
- Audio up to 80 seconds, embedded natively without ASR transcription
- Documents PDF input, up to 6 pages
Mixed input image + text together in a single embedding call
Gemini Embedding 2 vs. CLIP/CLAP One Model vs. Many for Multimodal Embeddings
| Dual-encoder (CLIP, CLAP) | Gemini Embedding 2 | |
|---|---|---|
| Modalities per model | 2 (e.g., image + text) | 5 (text, image, video, audio, PDF) |
| Adding a new modality | You bring another model and align spaces manually | Already included â one API call |
| Cross-modal input | Separate encoders, separate calls | Interleaved input (e.g., image + text in one request) |
| Architecture | Separate vision and text encoders aligned via contrastive loss | Single model inheriting multimodal understanding from Gemini |
Gemini Embedding 2âs Advantage: Pipeline Simplification
Take a common scenario: building a semantic search engine over a short video library. Each clip has visual frames, spoken audio, and subtitle text â all describing the same content.
Before Gemini Embedding 2, youâd need three separate embedding models (image, audio, text), three vector columns, and a retrieval pipeline that does multi-way recall, result fusion, and deduplication. Thatâs a lot of moving parts to build and maintain.
Now, you feed the videoâs frames, audio, and subtitles into a single API call and get one unified vector that captures the full semantic picture.
Naturally, itâs tempting to conclude that multi-vector columns are dead. But that conclusion confuses âmultimodal unified representationâ with âmulti-dimensional vector retrieval.â They solve different problems, and understanding the difference matters for picking the right approach.
What is Multi-Vector Search in Milvus?
In Milvus, multi-vector search means searching the same item through multiple vector fields at once and then combining those results with reranking.
The core idea: a single object often carries more than one kind of meaning. A product has a title and a description. A social media post has a caption and an image. Each angle tells you something different, so each one gets its own vector field.
Milvus searches every vector field independently, then merges the candidate sets using a reranker. In the API, each request maps to a different field and search configuration, and hybrid_search() returns the combined result.
Two common patterns depend on this:
- Sparse+Dense Vector Search. You have a product catalog where users type queries like âred Nike Air Max size 10.â Dense vectors catch the semantic intent (ârunning shoes, red, Nikeâ), but miss the exact size. Sparse vectors via BM25 or models like BGE-M3 nail the keyword match. You need both running in parallel, then reranked â because neither alone returns good results for queries that mix natural language with specific identifiers like SKUs, file names, or error codes.
- Multimodal Vector Search. A user uploads a photo of a dress and types âsomething like this but in blue.â You search the image embedding column for visual similarity and the text embedding column for the color constraint simultaneously. Each column has its own index and model â CLIP for the image, a text encoder for the description â and results get merged.
Milvus runs both patterns as parallel ANN searches with native reranking via RRFRanker. Schema definition, multi-index configuration, built-in BM25 â all handled in one system.
To make this concrete, consider a product catalog where each item includes a text description and an image. You can run three searches against that data in parallel:
- Semantic text search. Query the text description with dense vectors generated by models like BERT, Transformers, or the OpenAI embeddings API.
- Full-text search. Query the text description with sparse vectors using BM25 or sparse embedding models like BGE-M3 or SPLADE.
- Cross-modal image search. Query over product images using a text query, with dense vectors from a model like CLIP.
Now, with Gemini Embedding 2, Does Multi-Vector Search Still Matter?
Gemini Embedding 2 handles more modalities in one call, which simplifies pipelines considerably. But a unified multimodal embedding is not the same thing as multi-vector retrieval. In other words, multi-vector search will not become obsolete.
Gemini Embedding 2 maps text, images, video, audio, and documents into one shared vector space. Google positions it for multimodal semantic search, document retrieval, and recommendation â scenarios where all modalities describe the same content and high cross-modal overlap makes a single vector viable.
Milvus multi-vector search solves a different problem. It is a way to search the same object through multiple vector fieldsâ for instance, a title plus description, or text plus imageâand then combine those signals during retrieval. In other words, it is about preserving and querying multiple semantic views of the same item, not just compressing everything into one representation.
But real-world data rarely fits into a single embedding. Biometric systems, agentic tool retrieval, and mixed-intent e-commerce all depend on vectors that live in completely different semantic spaces. Thatâs exactly where a unified embedding stops working.
Why One Embedding Isnât Enough: Multi-Vector Retrieval in Practice
Gemini Embedding 2 handles the case where all your modalities describe the same thing. Multi-vector search handles everything else â and âeverything elseâ covers most production retrieval systems.
Biometrics. A single user has face, voiceprint, fingerprint, and iris vectors. These describe completely independent biological features with zero semantic overlap. You cannot collapse them into one vector â each needs its own column, index, and similarity metric.
Agentic tools. A coding assistant like OpenClaw stores dense semantic vectors for conversation history (âthat deployment issue from last weekâ) alongside sparse BM25 vectors for exact matching on file names, CLI commands, and config parameters. Different retrieval goals, different vector types, independent search paths, then reranked.
E-commerce with mixed intent. A productâs promo video and detail images work well as a unified Gemini embedding. But when a user wants âdresses that look like thisâ and âsame fabric, size M,â you need a visual similarity column and a structured attribute column with separate indexes and a hybrid retrieval layer.
When to Use Gemini Embedding 2 vs. Multi-vector Columns
| Scenario | What to use | Why |
|---|---|---|
| All modalities describe the same content (video frames + audio + subtitles) | Gemini Embedding 2 unified vector | High semantic overlap means one vector captures the full picture â no fusion needed |
| You need keyword precision alongside semantic recall (BM25 + dense) | Multi-vector columns with hybrid_search() | Sparse and dense vectors serve different retrieval goals that canât collapse into one embedding |
| Cross-modal search is the primary use case (text query â image results) | Gemini Embedding 2 unified vector | Single shared space makes cross-modal similarity native |
| Vectors live in fundamentally different semantic spaces (biometrics, structured attributes) | Multi-vector columns with per-field indexes | Independent similarity metrics and index types per vector field |
| You want pipeline simplicity and fine-grained retrieval | Both â unified Gemini vector + additional sparse or attribute columns in the same collection | Gemini handles the multimodal column; Milvus handles the hybrid retrieval layer around it |
These two approaches are not mutually exclusive. You can use Gemini Embedding 2 for the unified multimodal column and still store additional sparse or attribute-specific vectors in separate columns within the same Milvus collection.
Quick Start: Set Up Gemini Embedding 2 + Milvus
Hereâs a working demo. You need a running Milvus or Zilliz Cloud instance and a GOOGLE_API_KEY.
Setup
pip install google-genai pymilvus
export GOOGLE_API_KEY="your-api-key"
Full Example
"""
Prerequisites:
pip install google-genai pymilvus
Set environment variable:
export GOOGLE_API_KEY="your-api-key"
"""
import os
import struct
import numpy as np
from google import genai
from google.genai import types
from pymilvus import MilvusClient, DataType
# ââ Config âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
COLLECTION_NAME = "gemini_multimodal_demo"
MILVUS_URI = "http://localhost:19530" # Change to your Milvus address
DIM = 3072 # gemini-embedding-2-preview output dimension
GEMINI_MODEL = "gemini-embedding-2-preview"
# ââ Initialize clients ââââââââââââââââââââââââââââââââââââââââââââââââââ
gemini_client = genai.Client() # Uses GOOGLE_API_KEY env var
milvus_client = MilvusClient(MILVUS_URI)
# ââ Helper: generate embedding ââââââââââââââââââââââââââââââââââââââââââ
def embed_texts(texts: list[str], task_type: str = "SEMANTIC_SIMILARITY") -> list[list[float]]:
"""Embed a list of text strings."""
result = gemini_client.models.embed_content(
model=GEMINI_MODEL,
contents=texts,
config=types.EmbedContentConfig(task_type=task_type),
)
return [e.values for e in result.embeddings]
def embed_image(image_path: str) -> list[float]:
"""Embed an image file."""
with open(image_path, "rb") as f:
image_bytes = f.read()
mime = "image/png" if image_path.endswith(".png") else "image/jpeg"
result = gemini_client.models.embed_content(
model=GEMINI_MODEL,
contents=types.Part.from_bytes(data=image_bytes, mime_type=mime),
)
return result.embeddings[0].values
def embed_audio(audio_path: str) -> list[float]:
"""Embed an audio file."""
with open(audio_path, "rb") as f:
audio_bytes = f.read()
mime_map = {".mp3": "audio/mpeg", ".wav": "audio/wav", ".flac": "audio/flac"}
ext = os.path.splitext(audio_path)[1].lower()
mime = mime_map.get(ext, "audio/mpeg")
result = gemini_client.models.embed_content(
model=GEMINI_MODEL,
contents=types.Part.from_bytes(data=audio_bytes, mime_type=mime),
)
return result.embeddings[0].values
# ââ 1. Create Milvus collection âââââââââââââââââââââââââââââââââââââââââ
print("=== Creating collection ===")
if milvus_client.has_collection(COLLECTION_NAME):
milvus_client.drop_collection(COLLECTION_NAME)
schema = milvus_client.create_schema()
schema.add_field("id", DataType.INT64, is_primary=True, auto_id=True)
schema.add_field("content", DataType.VARCHAR, max_length=2000) # description of the content
schema.add_field("modality", DataType.VARCHAR, max_length=20) # "text", "image", "audio"
schema.add_field("vector", DataType.FLOAT_VECTOR, dim=DIM)
index_params = milvus_client.prepare_index_params()
index_params.add_index(
field_name="vector",
index_type="AUTOINDEX",
metric_type="COSINE",
)
milvus_client.create_collection(
COLLECTION_NAME,
schema=schema,
index_params=index_params,
consistency_level="Strong",
)
print(f"Collection '{COLLECTION_NAME}' created (dim={DIM}, metric=COSINE)")
# ââ 2. Insert text embeddings âââââââââââââââââââââââââââââââââââââââââââ
print("\n=== Inserting text embeddings ===")
documents = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"The Mona Lisa is a half-length portrait painting by Leonardo da Vinci.",
"Beethoven's Symphony No. 9 premiered in Vienna on May 7, 1824.",
"The Great Wall of China stretches over 13,000 miles across northern China.",
"Jazz music originated in the African-American communities of New Orleans.",
"The Hubble Space Telescope was launched into orbit on April 24, 1990.",
"Vincent van Gogh painted The Starry Night while in an asylum in Saint-RĂŠmy.",
"Machine learning is a subset of AI focused on learning from data.",
]
text_vectors = embed_texts(documents)
text_rows = [
{"content": doc, "modality": "text", "vector": vec}
for doc, vec in zip(documents, text_vectors)
]
milvus_client.insert(COLLECTION_NAME, text_rows)
print(f"Inserted {len(text_rows)} text documents")
# ââ 3. (Optional) Insert image embeddings âââââââââââââââââââââââââââââââ
# Uncomment and provide real image paths to test multimodal search
#
# image_files = [
# ("photo of the Mona Lisa painting", "mona_lisa.jpg"),
# ("satellite photo of the Great Wall of China", "great_wall.png"),
# ]
# for desc, path in image_files:
# if os.path.exists(path):
# vec = embed_image(path)
# milvus_client.insert(COLLECTION_NAME, [
# {"content": desc, "modality": "image", "vector": vec}
# ])
# print(f"Inserted image: {desc}")
# ââ 4. (Optional) Insert audio embeddings âââââââââââââââââââââââââââââââ
# Uncomment and provide real audio paths to test multimodal search
#
# audio_files = [
# ("Beethoven Symphony No.9 excerpt", "beethoven_9.mp3"),
# ("jazz piano improvisation", "jazz_piano.mp3"),
# ]
# for desc, path in audio_files:
# if os.path.exists(path):
# vec = embed_audio(path)
# milvus_client.insert(COLLECTION_NAME, [
# {"content": desc, "modality": "audio", "vector": vec}
# ])
# print(f"Inserted audio: {desc}")
# ââ 5. Search ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
print("\n=== Searching ===")
queries = [
"history of artificial intelligence",
"famous Renaissance paintings",
"classical music concerts",
]
query_vectors = embed_texts(queries, task_type="SEMANTIC_SIMILARITY")
for query_text, query_vec in zip(queries, query_vectors):
results = milvus_client.search(
COLLECTION_NAME,
data=[query_vec],
limit=3,
output_fields=["content", "modality"],
search_params={"metric_type": "COSINE"},
)
print(f"\nQuery: '{query_text}'")
for hits in results:
for rank, hit in enumerate(hits, 1):
print(f" [{rank}] (score={hit['distance']:.4f}, modality={hit['entity']['modality']}) "
f"{hit['entity']['content'][:80]}")
# ââ 6. Cross-modal search example (image query -> text results) âââââââââ
# Uncomment to search text collection using an image as query
#
# print("\n=== Cross-modal search: image -> text ===")
# query_image_vec = embed_image("query_image.jpg")
# results = milvus_client.search(
# COLLECTION_NAME,
# data=[query_image_vec],
# limit=3,
# output_fields=["content", "modality"],
# search_params={"metric_type": "COSINE"},
# )
# for hits in results:
# for rank, hit in enumerate(hits, 1):
# print(f" [{rank}] (score={hit['distance']:.4f}) {hit['entity']['content'][:80]}")
# ââ Cleanup ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
# milvus_client.drop_collection(COLLECTION_NAME)
# print(f"\nCollection '{COLLECTION_NAME}' dropped")
print("\nDone!")
For image and audio embeddings, use embed_image() and embed_audio() the same way â the vectors land in the same collection and same vector space, enabling true cross-modal search.
Gemini Embedding 2 Will be Available in Milvus/Zilliz Cloud Soon
Milvus is shipping deep integration with Gemini Embedding 2 through its Embedding Function feature. Once live, you wonât need to call embedding APIs manually. Milvus will auto-invoke the model (supporting OpenAI, AWS Bedrock, Google Vertex AI, and more) to vectorize raw data on insert and queries on search.
That means you get unified multimodal embedding from Gemini where it fits, and Milvusâs full multi-vector toolkit â sparse-dense hybrid search, multi-index schemas, reranking â where you need fine-grained control.
Want to try it? Start with the Milvus quickstart and run the demo above, or check out the hybrid search guide for the full multi-vector setup with BGE-M3. Bring your questions to Discord or Milvus Office Hours.
Keep Reading
- Whatâs Different About Gemini Embedding 2 When Compared to CLIP/CLAP
- Gemini Embedding 2 vs. CLIP/CLAP One Model vs. Many for Multimodal Embeddings
- Gemini Embedding 2âs Advantage: Pipeline Simplification
- What is Multi-Vector Search in Milvus?
- Now, with Gemini Embedding 2, Does Multi-Vector Search Still Matter?
- Why One Embedding Isn't Enough: Multi-Vector Retrieval in Practice
- When to Use Gemini Embedding 2 vs. Multi-vector Columns
- Quick Start: Set Up Gemini Embedding 2 + Milvus
- Setup
- Full Example
- Gemini Embedding 2 Will be Available in Milvus/Zilliz Cloud Soon
- Keep Reading
On This Page
Try Managed Milvus for Free
Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.
Get StartedLike the article? Spread the word


