RelevantSearch.AI
Pattern · Volume 05 · Section F --- Regression detection and continuous evaluation · Updated May 2026

Golden query sets and continuous evaluation

Source: Production methodology at major search companies; OpenSource Connections methodology writings; Quepid for judgment set management

Classification — Pattern for continuous evaluation against curated query sets with alerting on regression.

Intent

Detect search quality regressions automatically by running curated query sets against the current system frequently (daily or per-deployment) and alerting when metrics fall outside expected ranges.

Motivating Problem

Production search quality degrades silently. Code changes, configuration changes, model updates, index changes, and corpus changes can all cause regressions that don't surface as user complaints for weeks. By the time the regression is visible in business metrics, the team has lost time and trust. Continuous evaluation catches the regression early: a daily run that fails alerts the team before users notice.

How It Works

Golden query set composition. A small set (50–500 queries) of carefully chosen queries with known-good expected results. The queries cover: high-value queries (top traffic), edge cases (specific failure modes the team has fixed and wants to keep fixed), and representative queries across query classes. The set is smaller than the full judgment list (Section A) because continuous evaluation runs frequently and must be fast.

Hard query sets. A subset of the golden set focused specifically on queries known to be difficult — queries where the current system performs adequately but candidate changes have historically regressed. The hard set is the regression-test set: it catches the easy-to-introduce regressions that simpler queries might miss. New hard queries are added when a regression escapes other tests.

Run cadence. Daily for active production search; per-deployment for high-velocity teams; weekly for stable systems. The cadence should match the rate of change in the system; static systems need less frequent evaluation than systems under active development. The cadence shouldn't be so frequent that it overwhelms operations with false-positive alerts.

Per-query alerting. Aggregate metrics (mean NDCG@10) hide per-query regressions: one query dropping from 0.95 to 0.10 may not move the mean noticeably. Per-query alerting catches these: if any individual query's score drops by more than a threshold (e.g., 0.2), alert. The pattern catches regressions that aggregate alerting misses.

Aggregate alerting. Per-query alerting catches specific failures; aggregate alerting catches systemic ones. Track per-day aggregate NDCG over time; alert when the rolling average drops by more than a threshold. The combination of per-query and aggregate alerting catches both kinds of regression.

Alert routing. Alerts should reach the right people quickly. Search quality team or on-call engineer for immediate regression investigation; relevant feature team if the regression correlates with their deployment; broader notification if the regression affects business metrics. The routing infrastructure is operations practice, not specific to search; the application to search is what matters.

Investigation workflows. When an alert fires, the team needs to diagnose: was it a deployment? a model update? a corpus change? an external dependency? Searchable history of metric values over time supports the diagnosis. Tools like Splainer (OpenSource Connections) let engineers inspect why specific queries returned the results they did, supporting root-cause analysis. The diagnostic workflow is itself a substantial discipline; the future Search Operations Catalog covers it in depth.

Regression budget. Not every regression must be fixed immediately; some may be acceptable trade-offs for other improvements. Production teams typically maintain a regression budget: how much quality degradation is acceptable in exchange for what kind of capability gain? Explicit budgets enable principled decisions rather than ad-hoc "is this regression OK?" negotiations.

When to Use It

Production search above toy scale where quality matters. Teams with frequent deployments where each deployment could regress quality. Mature search systems where the cost of undetected regressions is significant. Any team with judgment lists (Section A) substantial enough to support frequent evaluation.

Alternatives — less frequent evaluation (weekly or monthly) for stable systems with low deployment frequency. Pure online monitoring (production click metrics) for teams without judgment list infrastructure. The continuous-evaluation pattern is best suited to teams with both judgment list infrastructure and active development; teams missing either may use lighter patterns.

Sources
  • Production methodology writings from search teams at major companies
  • OpenSource Connections methodology blog posts on continuous relevance testing
  • Quepid documentation on judgment set management and test cases

Read in context within Volume 05 →