Track retrieval latency, relevance scoring, agent loop iterations, and embedding quality to ensure agentic RAG performance.
Critical metrics:
1. Retrieval latency (p50, p95, p99): Should be <100ms for single query, <50ms for batch. Slow retrievals block agents and frustrate users.
2. Relevance recall@k: Of top-k retrieved documents, how many are relevant? Aim for >80% recall@5 in production.
3. Agent loop count: How many times does the agent re-query before answering? Median of 2–3 loops is healthy; >5 loops indicates poor embedding quality or irrelevant data.
4. Failed retrievals: Percentage of queries returning 0 results. Track by agent type. >5% indicates embedding drift or missing data.
5. Embedding freshness: How often are embeddings updated? Embeddings >30 days old degrade relevance by ~15%.
6. False positive rate: Documents retrieved but marked irrelevant by agent. >20% indicates query expansion is too broad; reduce k or add filters.
7. Agent success rate: Percentage of agent workflows completing without fallback. Target >95% without escalation.
8. Context window utilization: Average tokens consumed per agent query. Agentic workflows can hit LLM context limits if retrievals aren’t selective.
Milvus exports these metrics via Prometheus integration. Set up dashboards early.
Related Resources: