Source: Emerging methodology through 2023–2026; Thomas et al. (Microsoft) on LLM relevance assessment; vendor and practitioner reports
Classification — Using large language models to generate relevance judgments for query-document pairs.
Generate relevance judgments at scale using LLMs as automated assessors, accepting model-specific biases in exchange for low cost and high throughput, with explicit validation against expert gold sets.
Explicit labeling is high-quality but expensive and slow. Implicit signals are scaleable but biased and require modeling infrastructure. Crowdsourced labels are intermediate on both dimensions but require management overhead. LLM-as-judge fills a gap: lower cost than explicit labeling, faster than crowdsourcing, no production data dependency like implicit signals. The trade-off is that LLM judgments encode model biases that may not match the team's relevance definition; validation against expert labels is essential to know when LLM judgments are reliable enough to use.
Prompt design. The LLM is given the query, the document content, and the relevance scale with annotation guidelines. The model produces a grade. Prompt quality substantially affects judgment quality; vague prompts produce inconsistent grades, detailed prompts with examples produce more consistent and accurate grades. Production teams iterate on prompts using the gold set as validation.
Few-shot examples. Include examples in the prompt showing how each grade should be applied. The examples calibrate the model's interpretation of the scale. Without examples, models may interpret "relevant" differently than the team intends; with examples, the interpretation is more aligned. Example selection matters: include examples covering common cases, edge cases, and known difficult cases.
Model selection. Larger and more capable models typically produce better judgments. Claude Opus, GPT-5, Gemini family at full capability all produce reasonable judgments on standard relevance tasks. Smaller models can work for narrow domains with good prompting. The choice trades cost per judgment against judgment quality; the gold set validation tells you whether the quality is sufficient.
Validation against gold sets. Before relying on LLM-as-judge, validate against a small expert-labeled gold set: how often does the LLM's grade match the expert's grade? Agreement metrics (the same kappa metrics used for inter-annotator agreement) reveal whether the LLM is operating at expert-level consistency. Typical results in 2026: well-prompted strong models reach kappa 0.6–0.8 with experts on standard tasks, comparable to crowdsourced labels.
Drift over time. LLM models update; the same prompt may produce different grades after a model update. Production teams should: pin to specific model versions when reproducibility matters; rerun gold set validation after model updates; monitor for drift in judgment patterns.
Limitations and biases. LLMs encode biases from their training data. They may favor certain types of content, certain writing styles, certain document structures in ways that don't match the team's relevance definition. They may struggle with domain-specific relevance (legal, medical, technical) that requires expertise the model lacks. They may produce confident judgments on cases where uncertainty is appropriate. Production use of LLM-as-judge should always include monitoring for these failure modes.
Rapid iteration during development when fast judgment turnaround matters more than peak quality. Filling gaps in explicit-labeled judgment lists with comparable but cheaper labels. Pre-screening queries for expert labeling (LLM identifies likely-relevant candidates; experts confirm). Bootstrapping initial judgment lists for new domains where no labels exist yet.
Alternatives — explicit expert labeling for high-stakes judgments and gold sets. Crowdsourced labeling for cases where human assessor judgment is needed but expert is too expensive. Implicit signals at production scale. LLM-as-judge is best as a complement to these methods, not as a replacement.
- Thomas et al., "Large Language Models can Accurately Predict Searcher Preferences" (Microsoft, 2024)
- Various vendor and practitioner reports on LLM-as-judge methodology
- Anthropic and OpenAI documentation on building evaluation pipelines with LLMs
Code
// LLM-as-judge prompt template (production-quality)
// Returns a grade 0-3 for a query-document pair
const JUDGMENT_PROMPT = `
You are evaluating search result relevance for an e-commerce site
selling outdoor gear.
Rate the relevance of the product to the user\'s query on this scale:
0 - Not relevant. The product does not match what the user is looking
for.
1 - Related. The product is in a related category but is not what the
user wants.
2 - Relevant. The product matches the user\'s query and would likely
satisfy them.
3 - Highly relevant. The product is an excellent match for the query.
Examples:
Query: "running shoes"
Product: "Nike Pegasus 40 Men\'s Running Shoes"
Grade: 3 (highly relevant - directly matches the query)
Query: "running shoes"
Product: "Nike Pegasus 40 Women\'s Running Shoes"
Grade: 3 (highly relevant - matches except for gender specification)
Query: "running shoes"
Product: "Adidas Ultraboost Casual Sneakers"
Grade: 1 (related - shoes but not running shoes specifically)
Query: "running shoes"
Product: "Running Belt for Phone and Keys"
Grade: 0 (not relevant - accessory for running, not shoes)
Now rate the following:
Query: "{query}"
Product: "{document_title}"
Description: "{document_description}"
Respond with ONLY a single digit (0, 1, 2, or 3) and no other text.
`;
async function llmJudge(query, document) {
const response = await anthropic.messages.create({
model: \'claude-opus-4-7\',
max_tokens: 5,
messages: [{
role: \'user\',
content: JUDGMENT_PROMPT
.replace(\'{query}\', query)
.replace(\'{document_title}\', document.title)
.replace(\'{document_description}\', document.description)
}]
});
const text = response.content[0].text.trim();
const grade = parseInt(text);
if (\![0, 1, 2, 3].includes(grade)) {
console.warn(`Unexpected LLM response: ${text}`);
return null;
}
return grade;
}
// Validate against gold set
async function validateAgainstGold(goldJudgments) {
const llmGrades = [];
const expertGrades = [];
for (const { query, document, expertGrade } of goldJudgments) {
const llmGrade = await llmJudge(query, document);
if (llmGrade !== null) {
llmGrades.push(llmGrade);
expertGrades.push(expertGrade);
}
}
return cohenKappaWeighted(llmGrades, expertGrades); // target > 0.6
}