🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • What is the overhead of using a cross-encoder for reranking results compared to just using bi-encoder embeddings, and how can you minimize that extra cost in a system?

What is the overhead of using a cross-encoder for reranking results compared to just using bi-encoder embeddings, and how can you minimize that extra cost in a system?

The overhead of using a cross-encoder for reranking compared to a bi-encoder stems from computational complexity and latency. Bi-encoders generate embeddings for queries and documents independently, allowing precomputation of document embeddings. This means a search system can store embeddings in a vector database and retrieve top candidates quickly using cosine similarity. Cross-encoders, however, process the query and document together in a single forward pass, which captures richer interactions but requires processing every candidate pair in real-time. For example, reranking 1,000 results with a cross-encoder would involve 1,000 separate model inferences, while a bi-encoder only needs one inference for the query and relies on precomputed document embeddings. This makes cross-encoders significantly slower and more resource-intensive, especially at scale.

To minimize this overhead, a common strategy is to limit the number of candidates the cross-encoder processes. A bi-encoder can first retrieve a larger candidate set (e.g., 1,000 results), and the cross-encoder reranks only the top subset (e.g., 100). This reduces the cross-encoder’s workload by 90% while still improving result quality. Another approach is optimizing the cross-encoder model itself. Techniques like knowledge distillation can train smaller, faster cross-encoders that mimic larger models’ behavior. For instance, distilling a BERT-based cross-encoder into a TinyBERT variant reduces inference time without a major drop in accuracy. Additionally, hardware optimizations—such as using GPUs for batch inference or frameworks like ONNX Runtime for accelerated execution—can further cut latency.

System design choices also play a key role. Asynchronous processing can decouple the initial retrieval and reranking steps, allowing the cross-encoder to operate in parallel without blocking user requests. Caching frequent query-document pairs or pre-reranking popular documents during off-peak hours can reduce redundant computations. For example, a news aggregator might precompute cross-encoder scores for trending articles overnight. Finally, hybrid systems that selectively apply cross-encoders based on query complexity (e.g., using a classifier to detect ambiguous queries) ensure that the cost is only incurred when necessary. By combining these tactics, developers can balance the improved relevance of cross-encoders with manageable operational costs.

Like the article? Spread the word