RelevantSearch.AI
Pattern · Volume 01 · Section D --- Multi-stage retrieval and reranking · Updated May 2026

Cross-encoder reranking

Source: Sentence-Transformers cross-encoder models; commercial rerankers (Cohere Rerank, Voyage Rerank); open models (BGE Reranker, mxbai-rerank)

Classification — Reranking pattern using transformer models that jointly process query and document for high-precision scoring.

Intent

Apply joint query-document attention to rank candidate documents with much higher precision than independent embedding similarity, accepting higher per-document cost in exchange for better top-K quality.

Motivating Problem

Dense vector retrieval (Section B) encodes queries and documents independently into embeddings, then computes similarity. The independence is what makes retrieval scalable — documents can be embedded offline and indexed; only the query needs embedding at query time. But the independence is also a limitation: the model never sees query and document together, so it can't capture interactions specific to that pair. Cross-encoders address this by jointly processing query and document, capturing fine-grained interaction signals that bi-encoders (dense retrieval) miss — at the cost of being too expensive to apply at retrieval scale.

How It Works

Model architecture. A transformer (typically BERT-derived, or larger modern variants) takes the query and a candidate document as a single concatenated input. The model's attention layers process both jointly, computing fine-grained interactions: which query terms match which document terms, how the document's overall context affects each match, contextual signals that bi-encoders can't capture. The output is a single relevance score for that query-document pair.

The bi-encoder vs cross-encoder trade-off. A bi-encoder encodes query and document independently; the similarity is a single dot-product or cosine computation per pair, and document embeddings can be pre-computed. A cross-encoder requires the full transformer forward-pass per pair, which is hundreds to thousands of times more expensive but captures interactions the bi-encoder misses. Bi-encoders scale to billions of documents at retrieval time; cross-encoders are limited to hundreds of candidates per query.

Production deployment. Cross-encoders run as reranking stages after first-stage retrieval (Section A, B, or C). Typical configuration: retrieve 100–500 candidates with cheap methods; rerank with cross-encoder to produce final top 10–20. Latency per query depends on candidate count and model size; production systems target sub-100ms reranking budgets with batched scoring on accelerated hardware.

Model options. Open models (BGE Reranker, mxbai-rerank, sentence-transformers cross-encoders) are free and self-hostable but require infrastructure. Commercial APIs (Cohere Rerank, Voyage Rerank, Mixedbread Rerank) provide reranking as a service; latency is the network round-trip plus inference time, cost is per-document scored. Trade-offs are similar to the broader self-host vs API decision in the agentic AI series.

Fine-tuning. Cross-encoders can be fine-tuned on domain-specific labeled data (relevance judgments, click data, hard negatives mined from production). Fine-tuned models typically outperform off-the-shelf models on the domain they were tuned for. The discipline of cross-encoder fine-tuning is itself a specialty within search engineering; the relevance & ranking volume (planned future volume) would cover the methodology.

When to Use It

Production search where top-K precision matters and the cost of expensive reranking is justified by quality gains. RAG pipelines for agents where the documents passed to the LLM matter substantially. High-stakes search (legal, medical, customer service) where wrong top results have meaningful costs. Cases where first-stage retrieval quality is good but the ordering needs refinement.

Alternatives — LTR-based reranking with engineered features for cases where labeled training data is rich and feature engineering is feasible. Pure first-stage retrieval for low-stakes applications where rerank cost isn't justified. Multi-stage cascades that put cross-encoders even later (as a third stage after LTR reranking) for very high-quality requirements.

Sources
  • Nogueira and Cho, "Passage Re-ranking with BERT" (2019)
  • Sentence-Transformers cross-encoder documentation (sbert.net)
  • Cohere Rerank documentation (cohere.com/rerank)
  • BGE Reranker (huggingface.co/BAAI/bge-reranker-v2-m3)

Read in context within Volume 01 →