Batch Milvus queries, enable Scout quantization, use GPU-accelerated indexing, and pipeline retrieval with generation for parallel processing.
Latency bottlenecks typically occur at: (1) embedding query → Milvus (100ms–1s), (2) vector search in Milvus (100ms–5s depending on index), (3) Scout generation on 10M-token input (5–15s). Optimize each: (a) cache embeddings for common queries, (b) enable GPU indexing on Milvus (CUDA-accelerated IVF or HNSW), © quantize Scout to int8 or int4 (reduces latency 2–3x with minor quality loss), (d) batch requests and process in parallel. For agentic workflows where Scout re-queries Milvus, pipeline: while Scout generates token N, start fetching context for token N+1.
Infrastructure tips: run Milvus and Scout on separate GPUs (one for indexing, one for generation) to avoid contention. Use ONNX Runtime or TVM to compile Scout’s routing logic. Monitor token throughput (tokens/second) not just latency—Scout’s sparse routing means processing 10M tokens is faster than dense models processing 2M tokens. With aggressive optimization, Scout + Milvus achieves < 10 second end-to-end latency for multi-document synthesis.
Related Resources
- Milvus Performance Benchmarks — indexing and search speed
- Enhance RAG Performance — latency optimization
- RAG with vLLM — serving Scout efficiently