Source: TREC methodology (Voorhees, NIST); Manning et al., Introduction to Information Retrieval; Grainger, AI-Powered Search; OpenSource Connections methodology
Classification — The methodology for constructing the foundational judgment artifact — query selection, document pooling, judgment assignment.
Build a judgment list that supports reliable offline evaluation: representative queries that cover the production query distribution, document pools that capture the candidates any system might surface, judgments that are calibrated and reproducible across assessors.
A naive judgment list has predictable failure modes. Queries cherry-picked by the team don't reflect actual production traffic, so evaluation scores don't predict production quality. Documents judged are only those returned by the current system, so any candidate system surfacing different documents gets unfair scores (because the unjudged documents are treated as not-relevant by default). Judgments produced by a single assessor without calibration drift from any defensible relevance standard. The discipline of judgment list construction addresses these failure modes systematically.
Query sampling. The query list should reflect production traffic, not the team's intuitions about important queries. Sample from production query logs, stratified by query characteristics: head queries (high frequency, narrow set), torso queries (medium frequency, broader set), tail queries (low frequency, very diverse). A judgment list with only head queries misses the long-tail quality issues that drive user frustration. A typical production judgment list might have 200–500 queries spanning the frequency distribution.
Pooling. For each query, judgment lists need a pool of documents to judge. The naive approach — judge only what the current system returns — produces the bias described above. The TREC-style pooling approach: run multiple candidate systems against the query, take the top K from each, union the results, judge everything in the union. The pool includes documents the current system missed but candidates surface, eliminating the bias. Pool depth (K) typically 20–100; the deeper the pool, the better the evaluation supports systems that differ substantially from the baseline.
Relevance grading scale. The choice of scale shapes the judgment task. Binary (relevant / not relevant) is simple but loses information about how relevant. Graded scales (typically 0–4 or 0–5) capture more nuance but require clearer annotation guidelines. A common scale: 0 (not relevant), 1 (related but not what user wants), 2 (relevant), 3 (highly relevant), 4 (perfect match). The scale should match the relevance definition the team uses; consistency in interpretation matters more than the specific numeric range.
Annotation guidelines. Written guidelines explain what each grade means with concrete examples and edge cases. Without guidelines, different assessors interpret "relevant" differently and inter-annotator agreement is low. With guidelines, agreement is higher, judgments are reproducible, and disputes about specific judgments can be resolved by reference to the guideline. Annotation guideline development is itself a substantial discipline; production teams iterate on guidelines as edge cases are discovered.
Inter-annotator agreement. Measure how consistently different assessors grade the same query-document pairs. Common metrics: Cohen's kappa or weighted kappa for graded scales. Agreement scores below 0.6 suggest the annotation task or guidelines need work; scores above 0.8 are excellent. Low agreement means the judgment list is noisy; high agreement means it's reliable. Production teams typically maintain agreement targets and adjust guidelines when agreement drifts.
Maintenance. Judgment lists go stale. The corpus changes (new products, retired documents). User queries evolve (seasonal shifts, new terms). Production system changes alter what gets surfaced. A judgment list that worked well last quarter may not capture current quality concerns. Maintenance patterns: quarterly judgment refresh for active deployments; add new queries based on production trends; re-judge documents whose content changed; retire queries that no longer reflect current traffic.
Every production search team that does offline evaluation needs judgment lists. The investment is substantial but compounds across all subsequent tuning work; the alternative is each evaluation reinventing its own judgments. The pattern is foundational rather than optional.
Alternatives — pure online evaluation (A/B testing) for teams that can't justify judgment list investment but have enough traffic to support fast online experiments. LLM-as-judge (Section C) as a lower-cost approximation, with the caveats covered there. Most teams use judgment lists as the foundation and supplement with online evaluation; pure-online is rare in production search.
- Voorhees, "Variations in relevance judgments and the measurement of retrieval effectiveness" (TREC methodology, 1998)
- Manning, Raghavan, Schütze, Introduction to Information Retrieval, ch. 8 on evaluation
- Trey Grainger, AI-Powered Search, chapters on relevance judgment
- OpenSource Connections Quepid (quepid.com) for judgment list management
Code
// Example judgment list format (CSV)
// query, document_id, grade, judged_at, assessor_id
// Grade scale: 0=not relevant, 1=related, 2=relevant, 3=highly
relevant
query,document_id,grade,judged_at,assessor
"running shoes",prod_12345,3,2026-04-15,assessor_01
"running shoes",prod_12346,2,2026-04-15,assessor_01
"running shoes",prod_12347,0,2026-04-15,assessor_01
"running shoes",prod_98765,3,2026-04-15,assessor_02
"trail running shoes",prod_12345,2,2026-04-15,assessor_01
"trail running shoes",prod_45678,3,2026-04-15,assessor_01
// Example pooling script (Python pseudocode)
def build_judgment_pool(queries, candidate_systems, pool_depth=20):
pool = defaultdict(set)
for query in queries:
for system in candidate_systems:
results = system.search(query, top_k=pool_depth)
for doc_id in results:
pool[query].add(doc_id)
return pool
// Example inter-annotator agreement check (Python)
from sklearn.metrics import cohen_kappa_score
agreement = cohen_kappa_score(
assessor_a_grades,
assessor_b_grades,
weights=\'quadratic\' // for graded scales
)
print(f"Weighted kappa: {agreement:.3f}") // target > 0.6, ideally
> 0.8