RelevantSearch.AI
Pattern · Volume 05 · Section D --- Online evaluation · Updated May 2026

Interleaving (TDI and successors)

Source: Joachims, "Evaluating Retrieval Performance using Clickthrough Data" (2003); Radlinski, Kurup, Joachims, "How Does Clickthrough Data Reflect Retrieval Quality?" (CIKM 2008); Schuth, "Multi-Leaved Comparisons for Fast Online Evaluation" (2014)

Classification — Online evaluation pattern — blend two systems' rankings into a single result set per user, track which system's contributions get clicked.

Intent

Compare two ranking systems with much higher statistical efficiency than A/B testing by having each user effectively serve as their own experiment — seeing results from both systems and clicking those they prefer.

Motivating Problem

A/B testing for ranking changes is statistically inefficient: each user sees only one system, so detecting which is better requires many users for statistical significance. Interleaving solves the inefficiency: each user sees results from both systems merged into one list, and clicks on each system's contributions are direct evidence of which system's ranking that user preferred. Statistical signal per user is much stronger; experiments reach significance with roughly an order of magnitude fewer users than equivalent A/B tests.

How It Works

Team-Draft Interleaving (TDI). The canonical interleaving algorithm. Each ranking is treated as a team; the merged list is built by drafting: a coin flip determines which team picks first; that team adds its top result; the other team adds its top result; alternate until the result list is complete. Documents that both systems would have shown go to whichever team "drafted" them first; documents only one system would have shown go to that system's team. Each result in the merged list is attributable to one team.

Click attribution. When the user clicks a result, the click is credited to the team that contributed that result. Aggregate over many user impressions: if Team A's contributions get more clicks than Team B's, that's evidence A produced better results. Statistical inference uses the per-impression team-comparison as the unit of analysis.

Tie-breaking. Documents both systems rank highly create ties that need handling. TDI handles this via the draft order (first team to want a tied document gets it). Probabilistic Interleaving handles it differently. The choice affects experiment behavior; the canonical TDI handles most cases well.

Multi-leaved comparisons. The extension to more than two systems: instead of two teams, N teams contribute to the merged list. Each user sees results from N systems blended together; clicks are attributed across all N. The extension allows comparing many candidate rankings simultaneously, multiplying interleaving's efficiency advantage.

Production engineering. Interleaving requires real-time merging of two (or more) systems' rankings, real-time tracking of which system contributed each result, and click logging that associates clicks with contributing systems. The engineering is non-trivial; production teams that have implemented it consider the investment worthwhile because of the experimentation throughput it enables.

Limitations. Interleaving only compares rankings; non-ranking changes (different UX, different filters) can't be tested via interleaving. Interleaving assumes users perceive a single result list; if the UX separates contributions visually, the interleaving signal breaks down. Production teams typically use interleaving for ranking experiments and reserve A/B testing for system-level changes.

When to Use It

Comparing two or more candidate ranking algorithms. Comparing different parameter settings within the same algorithm. Quick experiments on smaller-traffic search systems where A/B testing would take too long. Sequential experimentation where running many ranking variants is feasible if each individual experiment is short.

Alternatives — A/B testing (prior entry) for non-ranking comparisons. Offline evaluation for changes that offline metrics handle adequately. The combination of interleaving for ranking and A/B for system-level changes is the working pattern for mature production search teams.

Sources
  • Joachims, "Evaluating Retrieval Performance using Clickthrough Data" (2003)
  • Radlinski, Kurup, Joachims, "How Does Clickthrough Data Reflect Retrieval Quality?" (CIKM 2008)
  • Schuth et al., "Multi-Leaved Comparisons for Fast Online Evaluation" (CIKM 2014)
  • Hofmann, Whiteson, de Rijke writings on interleaving and online learning

Read in context within Volume 05 →