Agentic RAG scales to millions of documents through Milvus sharding, index optimization, and query-time filtering.
Scaling techniques:
1. Sharding by partition key:
- Partition by document source, date range, or document type
- Each shard is independently indexed and queryable
- Agents filter by partition before querying
- Example: “supplier_id” sharding for supply chain agents
2. Index types: Milvus supports multiple indexes optimized for different scales:
- IVF_FLAT: Fast for <100M documents, accurate
- HNSW: Fast for 100M–1B documents, memory efficient
- GPU indexing: For billions of documents with sub-10ms latency
3. Query-time filtering:
- Agents specify metadata constraints (date range, source)
- Search space narrows from millions to thousands
- Retrieval latency remains <100ms even at billion-scale
4. Bit quantization: Store embeddings as 8-bit integers instead of 32-bit floats. 4x memory savings, minimal accuracy loss (<2%).
5. Reranking: Retrieve top-100 candidates quickly, then rerank with LLM to get top-5. Two-stage retrieval balances speed and accuracy.
Performance at scale:
- 1M documents: <50ms, any index
- 100M documents: <100ms with HNSW, partition filtering
- 1B documents: <200ms with GPU indexing, bit quantization
Agentic workflow optimization: Agents constrain queries early ("last 30 days of data", “from supplier A”). Reduces search space by 10–100x, keeping latencies low.
Milvus is designed for billion-scale agentic RAG. Plan ahead.
Related Resources: