Source: Production methodology; literature on LLM-as-judge (Zheng et al. 2023 MT-Bench, RAGAS, TruLens); current vendor documentation
Classification — Pattern for using a separate LLM to evaluate the quality of LLM-augmented search outputs, at scales that would be impractical for human judgment.
Provide judgment signal at scale by using an LLM to assess relevance of retrieved passages, faithfulness of synthesized answers, and citation correctness — with appropriate calibration against human judgment as ground truth.
LLM-augmented systems need evaluation beyond what human judgment can provide at scale. Human judgment is high-quality but expensive and slow — maybe 50 judgments per hour per judge. LLM-augmented systems may need thousands of judgments per day for regression detection, A/B testing, drift detection. Human judgment alone can't keep up.
LLM-as-judge fills the gap: a separate LLM call examines a (query, output) pair and rates quality. Calibrated against human judgment on a sample, LLM-as-judge produces useful evaluation signal at scale. The patterns here document the calibration discipline that makes LLM-as-judge reliable.
Define the judgment task precisely. 'Relevance' is too vague; the LLM judge needs a clear specification. Production patterns: write a judging rubric that defines specific levels (e.g., 3=highly relevant, 2=partially relevant, 1=marginally relevant, 0=irrelevant); include examples of each level; specify edge cases. The rubric is itself a versioned artifact that improves over time.
Choose the judge model. Stronger models produce more reliable judgments but cost more. Production patterns: use a model class above what generates the output (if generation is Sonnet, judging is Opus or above); use the same vendor or a different one based on independence preferences; pin to a specific version for stability.
Prompt design. The judge prompt should: provide the rubric clearly; show the query and the output; specify the output format (a single integer score, or structured JSON). Production patterns: ask for explanation before score (improves accuracy); use few-shot examples in the prompt; require exact-format output for parsing.
Calibrate against human judgment. Run a sample of (query, output) pairs through both human judges and the LLM judge. Measure agreement: percentage agreement, Cohen's kappa, correlation with human scores. Production patterns: calibrate on at least 100–500 pairs; recalibrate periodically as models update; pin judge model version once calibrated.
Run at scale. With the calibrated judge, evaluate the production system continuously: regression suite of (query, expected quality) pairs run on every deployment; A/B test arms judged automatically; drift detection by tracking judge scores over time.
Faithfulness-specific patterns. For RAG outputs, faithfulness judging examines each claim in the output against the source passages. Production patterns: extract claims from the output (with another LLM call if needed); for each claim, judge whether the passages support it; aggregate to a faithfulness score per output. RAGAS and TruLens are open-source frameworks implementing this pattern.
Citation correctness specific patterns. Examine each citation in the output: does the cited passage actually support the cited claim? Production patterns: parse out the citation markers and the claims they reference; for each (claim, cited passage) pair, judge whether the passage supports the claim; aggregate to a citation correctness score.
Avoiding judge bias. LLM judges have known biases: positional bias (preferring earlier candidates), length bias (preferring longer outputs), self-preference bias (preferring outputs from the same model family). Production patterns: randomize position in pairwise comparisons; control for length when judging multiple candidates; use a different model family for the judge than for generation.
Production LLM-augmented systems where ongoing quality monitoring at scale is needed. Regression suites for LLM-augmented features. A/B testing of LLM-augmented variants. Drift detection across model updates.
Less good fit — small systems where human judgment is sufficient. Very high-stakes domains where automated judgment isn't trusted enough (medical, legal). Workloads where the judge model itself is uncalibrated.
- Zheng et al. (2023) 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena'
- RAGAS framework documentation (github.com/explodinggradients/ragas)
- TruLens documentation (trulens.org)
- Anthropic documentation on LLM-as-judge patterns