To prioritize query throughput over recall, you can optimize both the index structure and search parameters to reduce computational overhead and speed up response times. The key is to simplify the search process by minimizing the data processed per query and leveraging efficient query execution strategies. Here’s how to approach this:
1. Simplify the Index Structure
Reducing the size and complexity of the index is the first step. Limit the number of indexed fields to only those critical for search—for example, avoid indexing metadata or fields rarely queried. Use simpler analyzers (e.g., standard
instead of edge_ngram
) to minimize tokenization overhead. Disable features like scoring, term vectors, or positional data if they’re unnecessary. For instance, in Elasticsearch, setting index_options: docs
skips storing positional information, which speeds up indexing and searching. Increase the index refresh interval (e.g., from 1s to 30s) to reduce segment creation frequency, which lowers I/O pressure and improves bulk query performance.
2. Optimize Search Parameters
Adjust query execution settings to prioritize speed. Use filters instead of queries where possible, as filters are cacheable and avoid scoring. Limit the number of results returned (e.g., size=10
) and disable tracking total hits (e.g., track_total_hits=false
) to skip costly count calculations. Choose a search type like query_then_fetch
(in distributed systems) to avoid the overhead of global scoring. For text searches, prefer term
or match
queries over phrase
or fuzzy
queries, which are computationally heavier. Use constant score wrappers to bypass relevance scoring when ranking isn’t critical.
3. Tune Distribution and Caching
Increase the number of replicas to distribute query load across nodes, improving parallelism. For example, in Elasticsearch, setting number_of_replicas=2
allows three copies of the index to handle read traffic. Use routing to restrict queries to specific shards, reducing the number of shards scanned. Enable request caching (e.g., Elasticsearch’s request_cache=true
) for repeated queries. If your dataset allows, precompute and cache common results (e.g., top-10 trending products) to serve them without hitting the index. These steps reduce redundant computation and network overhead, directly improving throughput.
By streamlining the index, optimizing query execution, and leveraging distribution/caching, you can achieve significant throughput gains. For example, a product search system might index only name
and category
, use filter-heavy queries, and cache frequent searches like “laptops under $500.” This approach balances speed with “good enough” recall for high-volume scenarios.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word