Source: General A/B testing methodology adapted for search; Kohavi et al., "Trustworthy Online Controlled Experiments" (2020); production methodology at major search companies

Classification — Online evaluation pattern — split production traffic between systems, compare outcomes statistically.

Intent

Measure whether a candidate search system produces better real-user outcomes than the current system by splitting production traffic and comparing per-user metrics with statistical rigor.

Motivating Problem

Offline metrics correlate with user outcomes but don't guarantee them. A change that improves NDCG might not improve clicks or conversions; conversely, a change that doesn't affect NDCG might substantially improve user outcomes through subtle effects offline metrics miss. A/B testing measures actual user outcomes, providing the closest available approximation to ground truth about whether a change helps. The cost: time (experiments need to run long enough for statistical significance), users (some users see the candidate system, which may be worse than the current), and operational complexity (running two systems in parallel).

How It Works

Experimental design. Define the metric of interest (CTR, conversion rate, revenue per session, satisfaction proxy). Define the user population to include (all users? specific segments?). Define the split (50/50 is standard; smaller candidate share for high-risk changes). Define the duration (long enough for statistical power; not so long that user experience degradation accumulates).

Statistical power calculation. Before running, calculate how many users (or sessions, or queries) are needed to detect a meaningful effect with sufficient statistical confidence. The calculation depends on baseline metric variance, expected effect size, desired significance level (typically p < 0.05), and desired power (typically 0.80). High-traffic search can detect 1% effects in days; low-traffic search may need weeks or months for the same detection.

Randomization. Users (or sessions, or other units) are assigned to control or treatment randomly. The randomization must be stable across the experiment (same user sees the same system throughout). Random assignment is what makes statistical inference valid; non-random assignment (e.g., showing the new system to early adopters) confounds the experiment.

Outcome measurement. Track the primary metric and a set of guardrail metrics that could indicate unintended consequences. Primary might be conversion rate; guardrails might be revenue per user, query reformulation rate, bounce rate, latency. Negative guardrail movement may make a positive primary movement unacceptable.

Statistical analysis. At experiment end, compute the metric for each arm; compute the difference; compute the statistical significance (typically via t-test, chi-square, or appropriate non-parametric test depending on metric type). Report effect size with confidence interval, not just p-value. The discipline of statistical interpretation matters; rushed conclusions from underpowered experiments produce false-positive deployments.

Sequential testing pitfalls. Running an experiment longer or peeking at results midway changes the statistical guarantees. Sequential testing methods (or Bayesian methods) address this; naive "watch the experiment and stop when significant" inflates false positive rates. Production teams should set the experiment duration upfront based on power analysis and resist the temptation to read results early.

Holdout populations. Some users (1–5%) may be excluded from experiments entirely as a long-term holdout: they see the current production system always. The holdout enables measurement of cumulative impact ("we've shipped 10 experiments; are users in the cumulative-shipped arm doing better than holdout?") that experiment-by-experiment analysis can't capture.

When to Use It

System-level changes that affect more than ranking (UI changes, presentation features, filter logic, complete architectural shifts). Cases where the change's impact on user outcomes is uncertain. High-stakes deployments where the cost of being wrong about a change is high. Validation before rolling out to 100% of users.

Alternatives — interleaving (next entry) for pure ranking comparisons, where its statistical efficiency reduces required sample size by an order of magnitude. Offline evaluation for changes that don't need online validation (small changes that offline metrics handle well). Multi-armed bandit methods for cases where minimizing exposure to worse variants matters more than rigorous statistical comparison.

Sources

Kohavi, Tang, Xu, Trustworthy Online Controlled Experiments (2020)
Daniel Tunkelang's writing on A/B testing for search
Vendor documentation: Coveo experimentation, Algolia A/B testing

A/B testing for search