NDCG (Normalized Discounted Cumulative Gain) is a metric used to evaluate the quality of search engine rankings or recommendation systems by measuring how well the order of results matches their actual relevance. Unlike simpler metrics like precision or recall, which treat all relevant items equally, NDCG accounts for both the relevance of items and their positions in the ranked list. This makes it particularly useful for real-world scenarios where top-ranked results have a bigger impact on user satisfaction. For example, a search engine that places the most relevant result in the first position is more valuable than one that buries it on page two, even if both return the same set of relevant items overall.
To calculate NDCG, you start with DCG (Discounted Cumulative Gain), which sums the relevance scores of results while applying a logarithmic discount to lower positions. This discount reflects the idea that users are less likely to notice or click on items further down the list. For instance, if a search returns results with relevance scores [3, 2, 3, 0, 1] (on a scale where 3 is highly relevant), the DCG is calculated as 3 + (2 / log2(2)) + (3 / log2(3)) + … . The result is then normalized by dividing by the ideal DCG (IDCG), which is the maximum possible DCG if all items were perfectly sorted by relevance. This normalization scales the score between 0 (worst) and 1 (best), allowing comparisons across different queries or datasets. For example, if the ideal order of the previous scores is [3, 3, 2, 1, 0], the IDCG would be higher than the actual DCG, and the ratio gives the NDCG.
NDCG is widely used because it aligns with user behavior: people expect the best results upfront, and a single highly relevant item at the top can outweigh several moderately relevant ones lower down. It’s also flexible, supporting graded relevance judgments (e.g., “highly relevant” vs. “somewhat relevant”) instead of treating relevance as binary. This is critical for nuanced tasks like product searches, where a user might prefer a perfect match over a partially correct one. While NDCG requires careful relevance labeling, its balance of simplicity and positional sensitivity makes it a go-to metric for benchmarking search algorithms, optimizing ranking models, and A/B testing improvements in production systems.