RelevantSearch.AI
Pattern · Volume 04 · Section D --- Neural rerankers · Updated May 2026

Cross-encoder reranking in production

Source: Nogueira and Cho, "Passage Re-ranking with BERT" (2019); production rerankers: sentence-transformers, BGE Reranker, monoT5, RankZephyr, Cohere Rerank, Voyage Rerank

Classification — Top-K reranking using transformer models that jointly process query and document for fine-grained interaction scoring.

Intent

Apply transformer-based cross-encoder scoring to a small candidate set to produce substantially higher top-K quality than feature-based LTR can achieve, accepting the higher computational cost in exchange for the quality.

Motivating Problem

Volume 1 Section D introduced the cross-encoder pattern at the architectural level: independent retrieval produces candidates; cross-encoder reranks them. This entry covers the methods themselves — which models to use, how to train them, how to deploy them productively in cascade architectures.

How It Works

The model architecture. A transformer (BERT, T5, Llama-based, or other modern architectures) takes a concatenated input: [CLS] query [SEP] document [SEP]. The transformer's attention layers compute joint query-document representations; a final classification head outputs a single relevance score. The model is fine-tuned on labeled relevance data (typically MS MARCO or domain-specific data) to predict relevance scores that correlate with human judgments.

Model size trade-offs. Smaller models (110M parameter BERT-base, 220M parameter T5-base) are faster and cheaper but less capable. Larger models (335M BERT-large, 770M T5-large, 7B+ Llama-based) are slower and more expensive but produce better rankings. Production deployments choose model size based on latency budget and quality requirements; many teams find that small-to-medium models (220M–500M parameters) provide good cost-quality balance.

Open model options. Sentence-transformers cross-encoders (sbert.net): the canonical open-source option, BERT-based, available in many sizes. BGE Reranker (BAAI): strong quality, multilingual, free and self-hostable. mxbai-rerank: Mixedbread AI's open release. monoT5 and RankT5: T5-based rerankers from the Pradeep/Nogueira/Lin group. RankZephyr: instruction-tuned listwise reranker. Each has trade-offs in quality, latency, and language coverage.

Commercial API options. Cohere Rerank (cohere.com/rerank): easy integration, multiple model sizes, multilingual. Voyage Rerank: high quality, especially on domain-specific data. Mixedbread Rerank: API access to the mxbai-rerank models. Trade-offs vs. self-hosting are the standard ones from the agentic AI series Volume 16: API simplifies operations and provides consistent quality; self-hosting reduces per-query cost at scale and provides more control.

Production deployment. Cross-encoder reranking is typically deployed as a final stage in cascade architectures (Volume 1 Section D). First-stage retrieval (BM25, vector, or hybrid) produces 100–1000 candidates; mid-stage LTR may reduce to 50–200 candidates; cross-encoder reranks the survivors. Latency is bounded by candidate count times per-pair scoring time; production teams tune candidate counts for the latency budget. Hardware (GPU, ideally) and batching (8–64 pairs per inference call) substantially affect achievable throughput.

Fine-tuning on domain data. Off-the-shelf rerankers are trained on general data (MS MARCO, NQ); fine-tuning on domain-specific labeled data typically produces 2–10% quality improvement. The training data: query-document pairs with relevance grades, hard negatives (documents that look relevant but aren't) mined from production, optionally easy negatives (clearly irrelevant documents). Training frameworks: sentence-transformers for self-hosted training; commercial APIs may offer fine-tuning as a service. The investment in fine-tuning pays off when domain quality matters and the team has training data infrastructure.

Calibration concerns. Cross-encoder scores are not directly comparable across queries (the model wasn't trained to produce calibrated scores). This matters for some downstream uses: weighted hybrid scoring (Volume 1 Section C) that combines reranker scores with other signals needs calibration. The calibration methods (Platt scaling, isotonic regression) are standard ML techniques applied to the reranker outputs.

When to Use It

Production search where top-K precision matters and the computational cost of cross-encoder reranking is justified. RAG pipelines (agentic AI Volume 10) where reranking improves the documents passed to the LLM. E-commerce, enterprise search, and customer-service search where ranking quality has business value. Cascade architectures where the cross-encoder is the final stage.

Alternatives — LTR-based reranking (Section B) for cases where the cross-encoder cost isn't justified. Late-interaction (next entry) for cost-quality trade-offs between bi-encoder and cross-encoder. Pure first-stage retrieval for low-stakes applications.

Sources
  • Nogueira and Cho, "Passage Re-ranking with BERT" (2019)
  • Pradeep, Nogueira, Lin on monoT5 and successor models
  • Sentence-Transformers documentation (sbert.net)
  • Cohere Rerank documentation (cohere.com/rerank)
  • BGE Reranker (huggingface.co/BAAI/bge-reranker-v2-m3)
Example artifacts

Code

# Production cross-encoder reranking with sentence-transformers
from sentence_transformers import CrossEncoder
import torch

# Load a production-quality reranker
# Trade-offs: ms-marco-MiniLM-L-6-v2 is fast (~80MB);
BGE-reranker-v2-m3 is higher quality (~570MB)
model = CrossEncoder(\'cross-encoder/ms-marco-MiniLM-L-6-v2\',
device=\'cuda\')

def rerank_candidates(query, candidates, top_k=10):
"""
Apply cross-encoder reranking to candidates from first-stage
retrieval.
candidates: list of dicts with \'doc_id\' and \'text\' fields
"""
# Build query-document pairs
pairs = [(query, c[\'text\']) for c in candidates]
# Score in batches for throughput; GPU batching is essential for
production latency
scores = model.predict(
pairs,
batch_size=32,
show_progress_bar=False,
convert_to_numpy=True
)
# Attach scores and sort
for c, s in zip(candidates, scores):
c[\'rerank_score\'] = float(s)
ranked = sorted(candidates, key=lambda c: c[\'rerank_score\'],
reverse=True)
return ranked[:top_k]

# Production pattern: rerank top 100 from retrieval to top 10 for
display
retrieved = first_stage_retrieve(query, top_k=100) # BM25, hybrid,
etc.
reranked = rerank_candidates(query, retrieved, top_k=10)

# Commercial API alternative: Cohere Rerank
import cohere
co = cohere.Client()

def rerank_via_cohere(query, candidates, top_k=10):
response = co.rerank(
model=\'rerank-english-v3.0\',
query=query,
documents=[c[\'text\'] for c in candidates],
top_n=top_k
)
return [
{**candidates[r.index], \'rerank_score\': r.relevance_score}
for r in response.results
]

# Latency monitoring (essential for production)
import time
start = time.perf_counter()
reranked = rerank_candidates(query, retrieved, top_k=10)
latency_ms = (time.perf_counter() - start) * 1000
print(f\'Rerank latency: {latency_ms:.1f}ms for {len(retrieved)}
candidates\')
# Target: < 100ms for 100 candidates on GPU; expect 5-10x longer on
CPU

Read in context within Volume 04 →