To ensure a vector store performs well under load, track metrics like queries per second (QPS), average search time, and recall at specific latency thresholds. These metrics help balance speed, scalability, and accuracy. Additionally, monitoring hardware utilization, error rates, and queue times provides insight into system health and bottlenecks.
First, QPS measures how many queries the system handles each second. This metric reflects throughput and helps determine if the system scales as load increases. For example, if QPS spikes from 100 to 500 but latency remains stable, the system scales well. However, if latency rises sharply, you may need to optimize indexing or add resources. Average search time (latency) is equally critical—track both mean and percentiles (e.g., p95, p99) to identify outliers. A system with 50ms average latency but 500ms p99 indicates sporadic slowdowns, which could stem from uneven resource distribution or inefficient query routing.
Second, recall at a given latency measures accuracy under constraints. For example, if a vector store achieves 90% recall at 100ms but drops to 70% at 50ms, you can tune parameters like the number of search probes or graph layers in HNSW indexes. Use ground-truth datasets to validate recall: compare top-K results against exact nearest neighbors. If recall degrades under load, it might signal that sharding or compression settings are too aggressive. Balancing recall and latency ensures users get relevant results without unacceptable delays.
Finally, track hardware metrics (CPU, memory, disk I/O) and error rates. High CPU usage during peak QPS suggests computational bottlenecks, while memory spikes may indicate inefficient caching. Disk I/O bottlenecks often occur when indexes exceed RAM capacity. Error rates (e.g., timeouts or failed queries) reveal stability issues—a 5% error rate under load could mean insufficient nodes or thread limits. Queue times (how long queries wait before processing) highlight concurrency limits; growing queues signal the need for horizontal scaling. For example, if queue times exceed 200ms, adding nodes or load balancers might alleviate backpressure. Combining these metrics provides a comprehensive view of performance and guides targeted optimizations.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word