Source: Industry-standard architecture, anchored in learning-to-rank literature (Liu, "Learning to Rank for Information Retrieval," 2009) and modern neural reranking practice
Classification — The dominant production search architecture: a fast retrieval stage produces candidates; a slower ranking stage reorders them.
Apply expensive ranking methods (LTR, cross-encoder rerankers, personalization) to a small candidate set produced by cheap first-stage retrieval, achieving better top-K quality than single-stage scoring at acceptable cost.
Single-stage scoring forces a difficult trade-off: scoring functions cheap enough to apply to the full corpus are too crude for high-precision ranking; scoring functions rich enough for precision are too expensive to apply to the full corpus. The two-stage pattern resolves this by using each stage for what it's good at: fast first-stage retrieval scales to billions of documents and produces good candidate recall; slower second-stage ranking applies expensive methods to 100–1000 candidates and produces good top-K precision. The decoupling lets both stages optimize their own constraints.
Stage 1 — retrieval. Cheap scoring methods (BM25, vector similarity, hybrid via RRF) run over the full corpus and produce a candidate set, typically 100–1000 documents. The stage optimizes for recall: anything genuinely relevant should appear in this set, even if not in its final correct order. Latency budget is typically 30–80ms; cost per query is whatever the cheap method costs.
Stage 2 — reranking. Expensive scoring methods (LTR models with hundreds of engineered features, cross-encoder neural rerankers, personalization layers, business-logic boosts) reorder the candidate set. The stage optimizes for top-K precision: positions 1–10 should be the best documents in the candidate set in the most useful order. Latency budget is typically 20–100ms; cost per query is the expensive scoring method's cost times the candidate count.
Stage boundary tuning. The number of candidates returned by stage 1 (the cutoff K) is a tunable parameter. Smaller K (e.g., 100) means stage 2 is cheaper but stage 2 has fewer candidates to choose from — missing the right document in stage 1 can't be recovered. Larger K (e.g., 1000) means stage 2 sees more candidates and can produce better top-K results, but stage 2 cost grows proportionally. Production deployments tune K based on the marginal quality improvement of larger candidate sets against the marginal cost.
Evaluation across stages. Each stage should be evaluated separately. Stage 1 is evaluated for recall at K (is the right document in the top K candidates); stage 2 is evaluated for top-K metrics (NDCG, MAP) given the candidate set. Stage 1 quality bounds the system's overall quality; stage 2 quality determines how well the system exploits stage 1's output. Confusing the two during evaluation leads to misdiagnosed quality issues.
Beyond two stages. Three-stage and four-stage cascades exist in production systems: ultra-cheap retrieval (e.g., BM25 alone) to 10,000 candidates, mid-cost reranking (LTR) to 100 candidates, expensive reranking (cross-encoder) to top 10–20. Each stage justifies its computational cost by applying methods that wouldn't scale to the prior stage's candidate count. Marginal benefit of additional stages diminishes; most production systems run two or three stages.
Almost every production search system above toy scale. The pattern is the default; single-stage scoring is reserved for narrow cases (very small corpora, very simple matching needs, latency-critical applications where the second stage's cost is unacceptable). Modern RAG pipelines for LLM-based agents use the same architecture.
Alternatives — single-stage scoring for narrow specialized cases. The multi-stage pattern is so universal that justifying single-stage requires specific reasons (latency constraints, corpus size, simplicity needs).
- Liu, "Learning to Rank for Information Retrieval" (2009) for LTR foundations
- Nogueira and Cho, "Passage Re-ranking with BERT" (2019) for neural reranker introduction
- Reimers and Gurevych, sentence-transformers cross-encoder documentation