Deploy Milvus in Kubernetes with sharding, replication, and autoscaling to handle concurrent agent queries in production.
Deployment strategy:
1. Kubernetes StatefulSets: Run Milvus as distributed nodes (query nodes, index nodes, data nodes). Enables horizontal scaling and fault tolerance.
2. Sharding by collection: Partition large embeddings across multiple shards. If agents query different domains (supply chain, finance), shard by agent type.
3. Replication factor ≥2: Agents cannot afford downtime. Run 2–3 replicas per shard for disaster recovery.
4. Autoscaling: Query nodes scale based on QPS (queries per second). Agents can spike to 1000s of queries during reasoning loops. Set thresholds at 70% CPU.
5. Persistent storage: Use cloud object storage (S3, GCS) for durability. Milvus snapshots to object storage periodically. Zero data loss.
6. Resource isolation: Allocate dedicated CPU/memory for agents vs. batch indexing. Agents need predictable latency; indexing can be bursty.
7. Connection pooling: Agent frameworks need to reuse Milvus connections. Prevent connection exhaustion during high-concurrency reasoning loops.
Monitoring: Track disk I/O, query latency by shard, and segment compaction status. Slow compaction blocks agents.
For self-hosted agentic RAG at scale, follow Milvus best practices. Containerization is mandatory.
Related Resources: