How does agentic RAG scale to millions of documents?

Agentic RAG scales to millions of documents through Milvus sharding, index optimization, and query-time filtering.

Scaling techniques:

1. Sharding by partition key:

  • Partition by document source, date range, or document type
  • Each shard is independently indexed and queryable
  • Agents filter by partition before querying
  • Example: “supplier_id” sharding for supply chain agents

2. Index types: Milvus supports multiple indexes optimized for different scales:

  • IVF_FLAT: Fast for <100M documents, accurate
  • HNSW: Fast for 100M–1B documents, memory efficient
  • GPU indexing: For billions of documents with sub-10ms latency

3. Query-time filtering:

  • Agents specify metadata constraints (date range, source)
  • Search space narrows from millions to thousands
  • Retrieval latency remains <100ms even at billion-scale

4. Bit quantization: Store embeddings as 8-bit integers instead of 32-bit floats. 4x memory savings, minimal accuracy loss (<2%).

5. Reranking: Retrieve top-100 candidates quickly, then rerank with LLM to get top-5. Two-stage retrieval balances speed and accuracy.

Performance at scale:

  • 1M documents: <50ms, any index
  • 100M documents: <100ms with HNSW, partition filtering
  • 1B documents: <200ms with GPU indexing, bit quantization

Agentic workflow optimization: Agents constrain queries early ("last 30 days of data", “from supplier A”). Reduces search space by 10–100x, keeping latencies low.

Milvus is designed for billion-scale agentic RAG. Plan ahead.

Related Resources:

Like the article? Spread the word