Milvus
Zilliz

How to Choose the Best Embedding Model for RAG in 2026: 10 Models Benchmarked

  • Engineering
March 26, 2026
Cheney Zhang

TL;DR: We tested 10 embedding models across four production scenarios that public benchmarks miss: cross-modal retrieval, cross-lingual retrieval, key information retrieval, and dimension compression. No single model wins everything. Gemini Embedding 2 is the best all-rounder. Open-source Qwen3-VL-2B beats closed-source APIs on cross-modal tasks. If you need to compress dimensions to save storage, go with Voyage Multimodal 3.5 or Jina Embeddings v4.

Why MTEB Isn’t Enough for Choosing an Embedding Model

Most RAG prototypes start with OpenAI’s text-embedding-3-small. It’s cheap, easy to integrate, and for English text retrieval it works well enough. But production RAG outgrows it fast. Your pipeline picks up images, PDFs, multilingual documents — and a text-only embedding model stops being enough.

The MTEB leaderboard tells you there are better options. The problem? MTEB only tests single-language text retrieval. It doesn’t cover cross-modal retrieval (text queries against image collections), cross-lingual search (a Chinese query finding an English document), long-document accuracy, or how much quality you lose when you truncate embedding dimensions to save storage in your vector database.

So which embedding model should you use? It depends on your data types, your languages, your document lengths, and whether you need dimension compression. We built a benchmark called CCKM and tested 10 models released between 2025 and 2026 across exactly those dimensions.

What Is the CCKM Benchmark?

CCKM (Cross-modal, Cross-lingual, Key information, MRL) tests four capabilities that standard benchmarks miss:

DimensionWhat It TestsWhy It Matters
Cross-modal retrievalMatch text descriptions to the correct image when near-identical distractors are presentMultimodal RAG pipelines need text and image embeddings in the same vector space
Cross-lingual retrievalFind the correct English document from a Chinese query, and vice versaProduction knowledge bases are often multilingual
Key information retrievalLocate a specific fact buried in a 4K–32K character document (needle-in-a-haystack)RAG systems frequently process long documents like contracts and research papers
MRL dimension compressionMeasure how much quality the model loses when you truncate embeddings to 256 dimensionsFewer dimensions = lower storage cost in your vector database, but at what quality cost?

MTEB covers none of these. MMEB adds multimodal but skips hard negatives, so models score high without proving they handle subtle distinctions. CCKM is designed to cover what they miss.

Which Embedding Models Did We Test? Gemini Embedding 2, Jina Embeddings v4, and More

We tested 10 models covering both API services and open-source options, plus CLIP ViT-L-14 as a 2021 baseline.

ModelSourceParametersDimensionsModalityKey Trait
Gemini Embedding 2GoogleUndisclosed3072Text / image / video / audio / PDFAll-modality, widest coverage
Jina Embeddings v4Jina AI3.8B2048Text / image / PDFMRL + LoRA adapters
Voyage Multimodal 3.5Voyage AI (MongoDB)Undisclosed1024Text / image / videoBalanced across tasks
Qwen3-VL-Embedding-2BAlibaba Qwen2B2048Text / image / videoOpen-source, lightweight multimodal
Jina CLIP v2Jina AI~1B1024Text / imageModernized CLIP architecture
Cohere Embed v4CohereUndisclosedFixedTextEnterprise retrieval
OpenAI text-embedding-3-largeOpenAIUndisclosed3072TextMost widely used
BGE-M3BAAI568M1024TextOpen-source, 100+ languages
mxbai-embed-largeMixedbread AI335M1024TextLightweight, English-focused
nomic-embed-textNomic AI137M768TextUltra-lightweight
CLIP ViT-L-14OpenAI (2021)428M768Text / imageBaseline

If your RAG pipeline handles images alongside text, the embedding model needs to place both modalities in the same vector space. Think e-commerce image search, mixed image-text knowledge bases, or any system where a text query needs to find the right image.

Method

We took 200 image-text pairs from COCO val2017. For each image, GPT-4o-mini generated a detailed description. Then we wrote 3 hard negatives per image — descriptions that differ from the correct one by just one or two details. The model has to find the right match in a pool of 200 images and 600 distractors.

An example from the dataset:

Vintage brown leather suitcases with travel stickers including California and Cuba, placed on a metal luggage rack against a blue sky — used as a test image in the cross-modal retrieval benchmark Vintage brown leather suitcases with travel stickers including California and Cuba, placed on a metal luggage rack against a blue sky — used as a test image in the cross-modal retrieval benchmark

Correct description: “The image features vintage brown leather suitcases with various travel stickers including 'California’, 'Cuba’, and 'New York’, placed on a metal luggage rack against a clear blue sky.”

Hard negative: Same sentence, but “California” becomes “Florida” and “blue sky” becomes “overcast sky.” The model has to actually understand the image details to tell these apart.

Scoring:

  • Generate embeddings for all images and all text (200 correct descriptions + 600 hard negatives).
  • Text-to-image (t2i): Each description searches 200 images for the closest match. Score a point if the top result is correct.
  • Image-to-text (i2t): Each image searches all 800 texts for the closest match. Score a point only if the top result is the correct description, not a hard negative.
  • Final score: hard_avg_R@1 = (t2i accuracy + i2t accuracy) / 2

Results

Horizontal bar chart showing Cross-Modal Retrieval Ranking: Qwen3-VL-2B leads at 0.945, followed by Gemini Embed 2 at 0.928, Voyage MM-3.5 at 0.900, Jina CLIP v2 at 0.873, and CLIP ViT-L-14 at 0.768 Horizontal bar chart showing Cross-Modal Retrieval Ranking: Qwen3-VL-2B leads at 0.945, followed by Gemini Embed 2 at 0.928, Voyage MM-3.5 at 0.900, Jina CLIP v2 at 0.873, and CLIP ViT-L-14 at 0.768

Qwen3-VL-2B, an open-source 2B parameter model from Alibaba’s Qwen team, came in first — ahead of every closed-source API.

Modality gap explains most of the difference. Embedding models map text and images into the same vector space, but in practice the two modalities tend to cluster in different regions. The modality gap measures the L2 distance between those two clusters. Smaller gap = easier cross-modal retrieval.

Visualization comparing large modality gap (0.73, text and image embedding clusters far apart) versus small modality gap (0.25, clusters overlapping) — smaller gap makes cross-modal matching easier Visualization comparing large modality gap (0.73, text and image embedding clusters far apart) versus small modality gap (0.25, clusters overlapping) — smaller gap makes cross-modal matching easier

ModelScore (R@1)Modality GapParams
Qwen3-VL-2B0.9450.252B (open-source)
Gemini Embedding 20.9280.73Unknown (closed)
Voyage Multimodal 3.50.9000.59Unknown (closed)
Jina CLIP v20.8730.87~1B
CLIP ViT-L-140.7680.83428M

Qwen’s modality gap is 0.25 — roughly a third of Gemini’s 0.73. In a vector database like Milvus, a small modality gap means you can store text and image embeddings in the same collection and search across both directly. A large gap can make cross-modal similarity search less reliable, and you may need a re-ranking step to compensate.

Cross-Lingual Retrieval: Which Models Align Meaning Across Languages?

Multilingual knowledge bases are common in production. A user asks a question in Chinese, but the answer lives in an English document — or the other way around. The embedding model needs to align meaning across languages, not just within one.

Method

We built 166 parallel sentence pairs in Chinese and English across three difficulty levels:

Cross-lingual difficulty tiers: Easy tier maps literal translations like 我爱你 to I love you; Medium tier maps paraphrased sentences like 这道菜太咸了 to This dish is too salty with hard negatives; Hard tier maps Chinese idioms like 画蛇添足 to gilding the lily with semantically different hard negatives Cross-lingual difficulty tiers: Easy tier maps literal translations like 我爱你 to I love you; Medium tier maps paraphrased sentences like 这道菜太咸了 to This dish is too salty with hard negatives; Hard tier maps Chinese idioms like 画蛇添足 to gilding the lily with semantically different hard negatives

Each language also gets 152 hard negative distractors.

Scoring:

  • Generate embeddings for all Chinese text (166 correct + 152 distractors) and all English text (166 correct + 152 distractors).
  • Chinese → English: Each Chinese sentence searches 318 English texts for its correct translation.
  • English → Chinese: Same in reverse.
  • Final score: hard_avg_R@1 = (zh→en accuracy + en→zh accuracy) / 2

Results

Horizontal bar chart showing Cross-Lingual Retrieval Ranking: Gemini Embed 2 leads at 0.997, followed by Qwen3-VL-2B at 0.988, Jina v4 at 0.985, Voyage MM-3.5 at 0.982, down to mxbai at 0.120 Horizontal bar chart showing Cross-Lingual Retrieval Ranking: Gemini Embed 2 leads at 0.997, followed by Qwen3-VL-2B at 0.988, Jina v4 at 0.985, Voyage MM-3.5 at 0.982, down to mxbai at 0.120

Gemini Embedding 2 scored 0.997 — the highest of any model tested. It was the only model to score a perfect 1.000 on the Hard tier, where pairs like “画蛇添足” → “gilding the lily” require genuine semantic understanding across languages, not pattern matching.

ModelScore (R@1)EasyMediumHard (idioms)
Gemini Embedding 20.9971.0001.0001.000
Qwen3-VL-2B0.9881.0001.0000.969
Jina Embeddings v40.9851.0001.0000.969
Voyage Multimodal 3.50.9821.0001.0000.938
OpenAI 3-large0.9671.0001.0000.906
Cohere Embed v40.9551.0000.9800.875
BGE-M3 (568M)0.9401.0000.9600.844
nomic-embed-text (137M)0.1540.3000.1200.031
mxbai-embed-large (335M)0.1200.2200.0800.031

The top 7 models all clear 0.93 on the overall score — the real differentiation happens on the Hard tier (Chinese idioms). nomic-embed-text and mxbai-embed-large, both English-focused lightweight models, score near zero on cross-lingual tasks.

Key Information Retrieval: Can Models Find a Needle in a 32K-Token Document?

RAG systems often process lengthy documents — legal contracts, research papers, internal reports containing unstructured data. The question is whether an embedding model can still find one specific fact buried in thousands of characters of surrounding text.

Method

We took Wikipedia articles of varying lengths (4K to 32K characters) as the haystack and inserted a single fabricated fact — the needle — at different positions: start, 25%, 50%, 75%, and end. The model has to determine, based on a query embedding, which version of the document contains the needle.

Example:

  • Needle: “The Meridian Corporation reported quarterly revenue of $847.3 million in Q3 2025.”
  • Query: “What was Meridian Corporation’s quarterly revenue?”
  • Haystack: A 32,000-character Wikipedia article about photosynthesis, with the needle hidden somewhere inside.

Scoring:

  • Generate embeddings for the query, the document with the needle, and the document without.
  • If the query is more similar to the document containing the needle, count it as a hit.
  • Average accuracy across all document lengths and needle positions.
  • Final metrics: overall_accuracy and degradation_rate (how much accuracy drops from shortest to longest document).

Results

Heatmap showing Needle-in-a-Haystack accuracy by document length: Gemini Embed 2 scores 1.000 across all lengths up to 32K; top 7 models score perfectly within their context windows; mxbai and nomic degrade sharply at 4K+ Heatmap showing Needle-in-a-Haystack accuracy by document length: Gemini Embed 2 scores 1.000 across all lengths up to 32K; top 7 models score perfectly within their context windows; mxbai and nomic degrade sharply at 4K+

Gemini Embedding 2 is the only model tested across the full 4K–32K range, and it scored perfectly at every length. No other model in this test has a context window that reaches 32K.

Model1K4K8K16K32KOverallDegradation
Gemini Embedding 21.0001.0001.0001.0001.0001.0000%
OpenAI 3-large1.0001.0001.0001.0000%
Jina Embeddings v41.0001.0001.0001.0000%
Cohere Embed v41.0001.0001.0001.0000%
Qwen3-VL-2B1.0001.0001.0000%
Voyage Multimodal 3.51.0001.0001.0000%
Jina CLIP v21.0001.0001.0001.0000%
BGE-M3 (568M)1.0001.0000.9200.9738%
mxbai-embed-large (335M)0.9800.6000.4000.66058%
nomic-embed-text (137M)1.0000.4600.4400.63356%

“—” means the document length exceeds the model’s context window.

The top 7 models score perfectly within their context windows. BGE-M3 starts to slip at 8K (0.920). The lightweight models (mxbai and nomic) drop to 0.4–0.6 at just 4K characters — roughly 1,000 tokens. For mxbai, this drop partly reflects its 512-token context window truncating most of the document.

MRL Dimension Compression: How Much Quality Do You Lose at 256 Dimensions?

Matryoshka Representation Learning (MRL) is a training technique that makes the first N dimensions of a vector meaningful on their own. Take a 3072-dimension vector, truncate it to 256, and it still holds most of its semantic quality. Fewer dimensions mean lower storage and memory costs in your vector database — going from 3072 to 256 dimensions is a 12x storage reduction.

Illustration showing MRL dimension truncation: 3072 dimensions at full quality, 1024 at 95%, 512 at 90%, 256 at 85% — with 12x storage savings at 256 dimensions Illustration showing MRL dimension truncation: 3072 dimensions at full quality, 1024 at 95%, 512 at 90%, 256 at 85% — with 12x storage savings at 256 dimensions

Method

We used 150 sentence pairs from the STS-B benchmark, each with a human-annotated similarity score (0–5). For each model, we generated embeddings at full dimensions, then truncated to 1024, 512, and 256.

STS-B data examples showing sentence pairs with human similarity scores: A girl is styling her hair vs A girl is brushing her hair scores 2.5; A group of men play soccer on the beach vs A group of boys are playing soccer on the beach scores 3.6 STS-B data examples showing sentence pairs with human similarity scores: A girl is styling her hair vs A girl is brushing her hair scores 2.5; A group of men play soccer on the beach vs A group of boys are playing soccer on the beach scores 3.6

Scoring:

  • At each dimension level, compute the cosine similarity between each sentence pair’s embeddings.
  • Compare the model’s similarity ranking to the human ranking using Spearman’s ρ (rank correlation).

What is Spearman’s ρ? It measures how well two rankings agree. If humans rank pair A as most similar, B second, C least — and the model’s cosine similarities produce the same order A > B > C — then ρ approaches 1.0. A ρ of 1.0 means perfect agreement. A ρ of 0 means no correlation.

Final metrics: spearman_rho (higher is better) and min_viable_dim (the smallest dimension where quality stays within 5% of full-dimension performance).

Results

Dot plot showing MRL Full Dimension vs 256 Dimension Quality: Voyage MM-3.5 leads with +0.6% change, Jina v4 +0.5%, while Gemini Embed 2 shows -0.6% at the bottom Dot plot showing MRL Full Dimension vs 256 Dimension Quality: Voyage MM-3.5 leads with +0.6% change, Jina v4 +0.5%, while Gemini Embed 2 shows -0.6% at the bottom

If you’re planning to reduce storage costs in Milvus or another vector database by truncating dimensions, this result matters.

Modelρ (full dim)ρ (256 dim)Decay
Voyage Multimodal 3.50.8800.8740.7%
Jina Embeddings v40.8330.8280.6%
mxbai-embed-large (335M)0.8150.7952.5%
nomic-embed-text (137M)0.7810.7740.8%
OpenAI 3-large0.7670.7620.6%
Gemini Embedding 20.6830.689-0.8%

Voyage and Jina v4 lead because both were explicitly trained with MRL as an objective. Dimension compression has little to do with model size — whether the model was trained for it is what matters.

A note on Gemini’s score: the MRL ranking reflects how well a model preserves quality after truncation, not how good its full-dimension retrieval is. Gemini’s full-dimension retrieval is strong — the cross-lingual and key information results already proved that. It just wasn’t optimized for shrinking. If you don’t need dimension compression, this metric doesn’t apply to you.

Which Embedding Model Should You Use?

No single model wins everything. Here’s the full scorecard:

ModelParamsCross-ModalCross-LingualKey InfoMRL ρ
Gemini Embedding 2Undisclosed0.9280.9971.0000.668
Voyage Multimodal 3.5Undisclosed0.9000.9821.0000.880
Jina Embeddings v43.8B0.9851.0000.833
Qwen3-VL-2B2B0.9450.9881.0000.774
OpenAI 3-largeUndisclosed0.9671.0000.760
Cohere Embed v4Undisclosed0.9551.000
Jina CLIP v2~1B0.8730.9341.000
BGE-M3568M0.9400.9730.744
mxbai-embed-large335M0.1200.6600.815
nomic-embed-text137M0.1540.6330.780
CLIP ViT-L-14428M0.7680.030

“—” means the model doesn’t support that modality or capability. CLIP is a 2021 baseline for reference.

Here’s what stands out:

  • Cross-modal: Qwen3-VL-2B (0.945) first, Gemini (0.928) second, Voyage (0.900) third. An open-source 2B model beat every closed-source API. The deciding factor was modality gap, not parameter count.
  • Cross-lingual: Gemini (0.997) leads — the only model to score perfectly on idiom-level alignment. The top 8 models all clear 0.93. English-only lightweight models score near zero.
  • Key information: API and large open-source models score perfectly up to 8K. Models below 335M start degrading at 4K. Gemini is the only model that handles 32K with a perfect score.
  • MRL dimension compression: Voyage (0.880) and Jina v4 (0.833) lead, losing less than 1% at 256 dimensions. Gemini (0.668) comes last — strong at full dimension, not optimized for truncation.

How to Pick: A Decision Flowchart

Embedding model selection flowchart: Start → Need images or video? → Yes: Need to self-host? → Yes: Qwen3-VL-2B, No: Gemini Embedding 2. No images → Need to save storage? → Yes: Jina v4 or Voyage, No: Need multilingual? → Yes: Gemini Embedding 2, No: OpenAI 3-large Embedding model selection flowchart: Start → Need images or video? → Yes: Need to self-host? → Yes: Qwen3-VL-2B, No: Gemini Embedding 2. No images → Need to save storage? → Yes: Jina v4 or Voyage, No: Need multilingual? → Yes: Gemini Embedding 2, No: OpenAI 3-large

The Best All-Rounder: Gemini Embedding 2

On balance, Gemini Embedding 2 is the strongest overall model in this benchmark.

Strengths: First in cross-lingual (0.997) and key information retrieval (1.000 across all lengths up to 32K). Second in cross-modal (0.928). Widest modality coverage — five modalities (text, image, video, audio, PDF) where most models top out at three.

Weaknesses: Last in MRL compression (ρ = 0.668). Beaten on cross-modal by the open-source Qwen3-VL-2B.

If you don’t need dimension compression, Gemini has no real competitor on the combination of cross-lingual + long-document retrieval. But for cross-modal precision or storage optimization, specialized models do better.

Limitations

  • We didn’t include every model worth considering — NVIDIA’s NV-Embed-v2 and Jina’s v5-text were on the list but didn’t make this round.
  • We focused on text and image modalities; video, audio, and PDF embedding (despite some models claiming support) weren’t covered.
  • Code retrieval and other domain-specific scenarios were out of scope.
  • Sample sizes were relatively small, so tight ranking differences between models may fall within statistical noise.

This article’s results will be outdated within a year. New models ship constantly, and the leaderboard reshuffles with every release. The more durable investment is building your own evaluation pipeline — define your data types, your query patterns, your document lengths, and run new models through your own tests when they drop. Public benchmarks like MTEB, MMTEB, and MMEB are worth monitoring, but the final call should always come from your own data.

Our benchmark code is open-source on GitHub — fork it and adapt it to your use case.


Once you’ve picked your embedding model, you need somewhere to store and search those vectors at scale. Milvus is the world’s most widely adopted open-source vector database with 43K+ GitHub stars built for exactly this — it supports MRL-truncated dimensions, mixed multimodal collections, hybrid search combining dense and sparse vectors, and scales from a laptop to billions of vectors.

  • Get started with the Milvus Quickstart guide, or install with pip install pymilvus.
  • Join the Milvus Slack or Milvus Discord to ask questions about embedding model integration, vector indexing strategies, or production scaling.
  • Book a free Milvus Office Hours session to walk through your RAG architecture — we can help with model selection, collection schema design, and performance tuning.
  • If you’d rather skip the infrastructure work, Zilliz Cloud (managed Milvus) offers a free tier to get started.

A few questions that come up when engineers are choosing an embedding model for production RAG:

Q: Should I use a multimodal embedding model even if I only have text data right now?

It depends on your roadmap. If your pipeline will likely add images, PDFs, or other modalities within the next 6–12 months, starting with a multimodal model like Gemini Embedding 2 or Voyage Multimodal 3.5 avoids a painful migration later — you won’t need to re-embed your entire dataset. If you’re confident it’s text-only for the foreseeable future, a text-focused model like OpenAI 3-large or Cohere Embed v4 will give you better price/performance.

Q: How much storage does MRL dimension compression actually save in a vector database?

Going from 3072 dimensions to 256 dimensions is a 12x reduction in storage per vector. For a Milvus collection with 100 million vectors at float32, that’s roughly 1.14 TB → 95 GB. The key is that not all models handle truncation well — Voyage Multimodal 3.5 and Jina Embeddings v4 lose less than 1% quality at 256 dimensions, while others degrade significantly.

Q: Is Qwen3-VL-2B really better than Gemini Embedding 2 for cross-modal search?

On our benchmark, yes — Qwen3-VL-2B scored 0.945 versus Gemini’s 0.928 on hard cross-modal retrieval with near-identical distractors. The main reason is Qwen’s much smaller modality gap (0.25 vs 0.73), which means text and image embeddings cluster closer together in vector space. That said, Gemini covers five modalities while Qwen covers three, so if you need audio or PDF embedding, Gemini is the only option.

Q: Can I use these embedding models with Milvus directly?

Yes. All of these models output standard float vectors, which you can insert into Milvus and search with cosine similarity, L2 distance, or inner product. PyMilvus works with any embedding model — generate your vectors with the model’s SDK, then store and search them in Milvus. For MRL-truncated vectors, just set the collection’s dimension to your target (e.g., 256) when creating the collection.

Like the article? Spread the word

Keep Reading