To evaluate vector search results beyond basic recall and precision, metrics like nDCG, MRR, and F1-score are commonly used. Each captures distinct aspects of performance, such as ranking quality, the importance of top results, and the balance between relevance and noise. Here’s how they work and when to use them:
nDCG (Normalized Discounted Cumulative Gain) measures the quality of ranked results by considering both relevance and position. Unlike recall and precision, which treat all relevant items equally, nDCG assigns higher weight to relevant items appearing earlier in the results. For example, if a search returns three documents with relevance scores [3, 1, 2] (on a scale of 0–3), the score is calculated by discounting the gain of each item based on its position. The “normalized” part ensures the metric is scaled against an ideal ranking, making it easier to compare across queries. This metric is useful for applications where the order of results matters, like recommendation systems or search engines, as it penalizes systems that bury highly relevant items in lower positions.
MRR (Mean Reciprocal Rank) focuses on the rank of the first relevant result in a list. For each query, it calculates the reciprocal of the position of the first correct answer (e.g., if the first relevant item is at position 3, the score is 1/3). The mean across all queries gives the MRR. This metric is ideal for tasks where the user expects a single correct answer quickly, such as question-answering systems or voice assistants. For instance, if a user asks, “What’s the capital of France?” and the correct answer (“Paris”) appears in the second position, the MRR for that query is 0.5. MRR doesn’t account for multiple relevant results, making it less suitable for scenarios requiring diverse outputs.
F1-score balances precision (fraction of retrieved items that are relevant) and recall (fraction of relevant items retrieved). It’s the harmonic mean of the two, calculated as 2(precisionrecall)/(precision+recall). For example, if a search returns 8 relevant items out of 10 retrieved (precision=0.8) and misses 2 relevant items (recall=0.8), the F1-score is 0.8. This metric is useful when there’s a need to minimize both false positives (irrelevant results) and false negatives (missed relevant items). It’s widely used in classification tasks but can also apply to search when evaluating binary relevance (e.g., spam detection in emails). However, it doesn’t consider ranking order, so it’s best paired with other metrics for ranking-focused systems.
Each metric addresses specific gaps in recall and precision: nDCG evaluates ranking quality, MRR emphasizes the speed of finding a correct answer, and F1-score balances relevance trade-offs. Choosing the right metric depends on the application’s priorities, such as whether order matters, how many results are needed, or how to handle partial relevance. Combining multiple metrics often provides a more complete picture of system performance.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word