How should you evaluate agentic RAG embeddings for Milvus?

Evaluate embeddings for agentic RAG by testing retrieval recall, semantic consistency, and agent loop efficiency on domain-specific benchmarks.

Evaluation framework:

1. Retrieval recall@k:

  • Create ground-truth pairs: (query, relevant_documents)
  • For each query, count how many relevant docs appear in top-k retrieved results
  • Calculate recall = (relevant_docs_found / total_relevant_docs)
  • Target: >80% recall@5 for your domain

2. Semantic consistency test:

  • Query variations should retrieve the same document
  • Example: “What happened in Q4?” and “What was the outcome in October–December?” should both retrieve Q4 reports
  • Measure: percentage of query variations retrieving the same top result
  • Target: >90% consistency

3. Agent loop efficiency:

  • Run agents on test queries
  • Count average loops needed to answer
  • Measure context tokens consumed
  • Target: median of 2–3 loops, <500 tokens per query

4. Domain adaptation:

  • Test on supply chain queries, legal queries, customer support queries separately
  • Some embeddings excel at semantic understanding but fail on domain-specific terminology
  • Choose embeddings with domain-specific fine-tuning if available

5. Latency at scale:

  • Index 1M+ embeddings in Milvus
  • Measure p95 query latency
  • Target: <100ms for single query, <500ms for agent loop (3 queries)

Recommended test set: Use MTEB benchmarks + your own domain queries. Include edge cases (misspellings, abbreviations, acronyms).

Poor embeddings are the #1 cause of agent loop failures. Invest time in evaluation.

Related Resources:

Like the article? Spread the word