Source: Järvelin and Kekäläinen, "Cumulated gain-based evaluation of IR techniques" (ACM TOIS, 2002); ubiquitous in modern search evaluation
Classification — The dominant offline ranking metric — graded relevance combined with logarithmic position discount.
Score a ranked result list by combining the relevance grades of its results with a position discount that rewards relevant results appearing higher, normalized to enable comparison across queries with different ideal scores.
Search quality involves both "were good results found" and "were they at the top." Simpler metrics handle one or the other: precision-at-K handles "were good results found in the top K" but ignores position within the top K; reciprocal rank handles "where was the first good result" but ignores everything after. NDCG captures both dimensions in a single principled metric with graded relevance support, making it the default offline evaluation metric for modern search systems.
DCG computation. For a ranked result list, DCG (Discounted Cumulative Gain) sums the relevance gain at each position divided by a logarithmic position discount. The standard formula: DCG = sum over positions i of (rel_i / log₂(i + 1)), where rel_i is the relevance grade at position i. Position 1 gets full credit (divisor = log₂(2) = 1); position 2 gets 0.63 credit; position 10 gets 0.29; position 100 gets 0.15. The discount falls off but never reaches zero, so relevant documents anywhere in the list contribute.
Variant with exponential gain. Some implementations use 2^rel - 1 instead of rel directly, exponentially weighting high-grade results. The variant emphasizes finding highly-relevant documents over finding more moderately-relevant ones. The exponential variant is the original Järvelin-Kekäläinen formulation; the linear variant is also widely used. The choice affects scores but not relative rankings of systems on most workloads.
Normalization. Raw DCG isn't comparable across queries because queries with more relevant documents have higher possible DCG. NDCG normalizes by dividing by the ideal DCG (the DCG achieved by ranking all relevant documents in optimal order). NDCG = DCG / ideal-DCG produces scores in [0, 1] that are comparable across queries.
Truncation: NDCG@K. Most production evaluation uses NDCG@K — NDCG computed over only the top K positions. NDCG@10 is the most common variant: it reflects "quality of the first page of results" for typical search UX. NDCG@5 or NDCG@3 emphasize top-of-page even more. The truncation choice should match the use case; e-commerce typically uses NDCG@10 or NDCG@20; question-answering may use NDCG@3 or NDCG@5.
Aggregation across queries. The per-query NDCG scores are typically averaged across the query set to produce a single system score. Macro-averaging (mean of per-query scores) is the standard. Sometimes median or other robust statistics are used to reduce influence of outlier queries.
Limitations. NDCG depends on judgment quality — garbage in, garbage out. NDCG assumes the relevance grades are accurate; biased or noisy judgments produce biased or noisy NDCG. NDCG also doesn't directly measure user outcomes (click-through, conversion, revenue); it measures a proxy for relevance that correlates with outcomes but isn't identical to them. Production teams typically track NDCG alongside business metrics.
Default offline evaluation metric for ranking quality across most search use cases. Especially appropriate for e-commerce, web search, enterprise search, and content discovery where graded relevance and top-K focus both matter.
Alternatives — MRR for known-item search where only the first relevant result matters; MAP for exhaustive retrieval (legal, scientific); P@K for simpler interpretability; ERR for cases where user-stopping models matter. NDCG is the default; specific alternatives apply when their specific dimensions matter more.
- Järvelin and Kekäläinen, "Cumulated gain-based evaluation of IR techniques" (2002)
- Manning et al., Introduction to Information Retrieval, section 8.4
- Grainger, AI-Powered Search, chapters on evaluation
Code
// NDCG@K implementation (Python)
import math
def dcg(grades, k=None):
"""Compute DCG over a list of relevance grades (already in ranked
order)."""
if k is not None:
grades = grades[:k]
return sum(
grade / math.log2(i + 2) # i+2 because i is 0-indexed; position is
i+1
for i, grade in enumerate(grades)
)
def ndcg_at_k(grades, ideal_grades, k=10):
"""NDCG@K: DCG of actual ranking divided by DCG of ideal
ranking."""
dcg_actual = dcg(grades, k=k)
dcg_ideal = dcg(sorted(ideal_grades, reverse=True), k=k)
return dcg_actual / dcg_ideal if dcg_ideal > 0 else 0.0
// Example usage with the judgment list from Section A
// Run the system, get document IDs in ranked order, look up grades:
query = "running shoes"
results = system.search(query, top_k=10) // returns ranked doc IDs
grades_in_results = [judgments[query].get(doc_id, 0) for doc_id in
results]
all_grades_for_query = list(judgments[query].values())
score = ndcg_at_k(grades_in_results, all_grades_for_query, k=10)
print(f"NDCG@10 for \'{query}\': {score:.3f}")
// Aggregate across queries
import statistics
per_query_scores = [
ndcg_at_k(
[judgments[q].get(doc_id, 0) for doc_id in system.search(q,
top_k=10)],
list(judgments[q].values()),
k=10
)
for q in test_queries
]
print(f"Mean NDCG@10: {statistics.mean(per_query_scores):.3f}")
print(f"Median NDCG@10: {statistics.median(per_query_scores):.3f}")