RelevantSearch.AI
Pattern · Volume 05 · Section C --- Judgment collection methods · Updated May 2026

Explicit expert labeling

Source: Search relevance practitioner methodology; Quepid (OpenSource Connections); enterprise search teams at major e-commerce and content companies

Classification — Judgment collection by in-house specialists or trained domain experts who assign relevance grades according to annotation guidelines.

Intent

Produce high-quality relevance judgments by using assessors who understand the domain, the relevance definition, and the edge cases, accepting higher cost in exchange for higher quality.

Motivating Problem

Crowdsourced labels are cheap but noisy. Implicit signals are scaleable but biased. LLM-as-judge is fast but encodes model biases. For some uses — small gold sets validating other judgment sources, high-stakes domains where lower-quality judgments are unacceptable, calibration of annotation guidelines — the quality of expert labeling is necessary. Expert labeling can't scale to the volume that crowdsourcing handles, but for the cases that need it, nothing else substitutes.

How It Works

Assessor selection. Domain expertise matters: e-commerce search assessors should understand the product domain; legal search assessors should understand legal relevance; healthcare assessors should have appropriate clinical knowledge. The assessors are typically in-house staff (search quality team, product specialists) or contracted SMEs (consultants, retired experts). Throughput is low — 50–200 judgments per hour per assessor depending on domain complexity — and cost is high (assessor time at SME rates).

Annotation tooling. Quepid (OpenSource Connections, free open-source) is the leading judgment management tool. The tool presents query-document pairs to assessors with the document content displayed; assessors assign grades with single-key input; the tool tracks assessor identity, timestamps, and inter-annotator agreement. Custom tools built on internal infrastructure are common for teams with specific requirements. The tool matters: a well-designed tool can double assessor throughput vs. a poorly-designed one.

Calibration sessions. Before independent labeling, assessors work through shared examples together, discussing edge cases and arriving at consistent interpretations of the relevance scale. The session produces calibration: assessors who have done this together agree more often than assessors who haven't. Recalibration sessions monthly or quarterly maintain agreement over time as new edge cases emerge.

Quality measurement. Inter-annotator agreement (Cohen's kappa or weighted kappa for graded scales) measures how consistently different assessors grade the same items. A subset of items is judged by multiple assessors to enable the measurement. Agreement scores below 0.6 trigger review of guidelines or recalibration; scores above 0.8 indicate the labeling task is well-defined and assessors are aligned.

Workload management. Expert assessors are expensive; their time should be used on the highest-value judgments. Patterns: judge new queries that production logs surface; judge query-document pairs near decision boundaries (where current and candidate systems disagree); judge the gold set used to validate other judgment sources; spot-check crowdsourced or LLM-judge outputs. Production teams typically have explicit workload prioritization.

Documentation. Maintain written annotation guidelines that capture the relevance definition, scale interpretation, edge cases, and decision rules. The guidelines are versioned; changes to guidelines may require re-judgment of affected items. The guidelines are the institutional knowledge of the search team's relevance discipline; without them, expert labeling doesn't survive assessor turnover.

When to Use It

Small gold sets (50–500 queries) used to validate other judgment sources. High-stakes domains (legal, medical, regulated) where lower-quality judgments are unacceptable. Calibration of annotation guidelines that other methods will follow. Edge cases requiring domain expertise to judge correctly. Annual or semi-annual high-quality evaluation that gold-standard methods support.

Alternatives — crowdsourced labeling for larger judgment volume at lower cost (next entry). Implicit signals for very large scale. LLM-as-judge for fast iteration. Expert labeling is the highest-quality method; the alternatives substitute for scale or cost reasons.

Sources
  • Quepid (quepid.com / github.com/o19s/quepid) for judgment management
  • OpenSource Connections methodology writings
  • Trey Grainger, AI-Powered Search, chapters on relevance judgment workflow

Read in context within Volume 05 →