Source: Classical IR literature; Manning et al., Introduction to Information Retrieval; ubiquitous in evaluation tooling (trec_eval, ranx, pytrec_eval)
Classification — Alternative offline metrics that handle specific cases NDCG doesn't cover well.
Apply the right metric for cases where NDCG isn't the best fit: MRR for known-item search, MAP for exhaustive retrieval, P@K for simpler interpretability, ERR for user-stopping models.
NDCG is the default but isn't always the best fit. Known-item searches ("nike air max 270") want a metric that scores high if the right item is at the top and ignores everything else; NDCG's position-discount handles this but isn't as crisp as MRR. Exhaustive retrieval (legal documents, scientific papers, e-discovery) wants a metric that scores recall across all relevant documents; NDCG's position discount makes it less suited than MAP. Communication with non-technical stakeholders favors metrics that are intuitive; P@K ("8 of the top 10 were good") is more communicable than NDCG. The discipline is matching the metric to the use case rather than defaulting to one metric for everything.
MRR — Mean Reciprocal Rank. For each query, find the position of the first relevant result. Compute 1/position (1.0 for position 1, 0.5 for position 2, 0.1 for position 10, 0 if no relevant result in top K). Average across queries. The metric is tailored to known-item search: "did you put the right answer at the top, and if not, how close?" Strong for question-answering and navigational search. Weak for discovery queries where users want to explore multiple results.
MAP — Mean Average Precision. For each query, walk through the ranked results; at each position where a relevant document appears, compute the precision at that position (fraction of top-N results that are relevant for that N). Average these per-relevant-document precisions to get the Average Precision for that query. Mean across queries. The metric handles binary relevance and rewards systems that find all relevant documents, not just the top few. Strong for exhaustive retrieval (legal, scientific, e-discovery). Weak when graded relevance matters or when only top-K matters.
P@K — Precision at K. The fraction of the top K results that are relevant. Binary relevance (a document is relevant or not). Simple and interpretable. P@1, P@5, P@10 are common. Doesn't care about position within the top K; doesn't care about anything past K. Good for high-level communication; less informative than NDCG for detailed evaluation.
ERR — Expected Reciprocal Rank. Models user stopping behavior: the probability that a user stops at each position based on the result's relevance. A highly relevant result early in the list "absorbs" the user's attention; documents below it have lower expected impact. ERR is more theoretically principled than NDCG for modeling user behavior; less widely adopted because the marginal benefit over NDCG is small for most workloads and the additional complexity isn't justified.
Recall@K. The fraction of all relevant documents that appear in the top K. Used in retrieval evaluation (first-stage retrieval in cascade architectures, Volume 1 Section D) where the question is whether relevant documents make it into the candidate set, not their final ranking. Recall@100 or Recall@1000 are common for first-stage retrieval; recall in the candidate set bounds what the reranker can achieve.
When to use multiple. Production teams typically track multiple metrics in parallel: NDCG@10 as the headline offline metric; MRR for navigational query subsets; P@5 for stakeholder communication; Recall@100 for first-stage retrieval evaluation. Each metric catches what the others miss; using them together produces fuller evaluation than any one alone.
MRR: known-item search, question-answering, navigational queries. MAP: exhaustive retrieval, legal/scientific search, e-discovery. P@K: communication with non-technical stakeholders, top-of-page evaluation. ERR: research-grounded evaluation methodology, cases where the user-stopping model matters. Recall@K: first-stage retrieval evaluation in cascade architectures.
Alternatives — NDCG@K as the default for graded-relevance, top-K-focused use cases (most modern production search). The metrics in this entry are alternatives for specific cases; for general search evaluation, NDCG remains the default.
- Manning et al., Introduction to Information Retrieval, ch. 8
- Chapelle and Zhang, "Expected Reciprocal Rank for Graded Relevance" (CIKM 2009)
- trec_eval reference implementation (github.com/usnistgov/trec_eval)
- pytrec_eval / ranx Python implementations