Source: Production RAG methodology; Cohere Rerank documentation; sentence-transformers library; literature on cross-encoders 2019–2025

Classification — Pattern for adding semantic reranking on top of traditional retrieval, using cross-encoder models that score (query, document) pairs.

Intent

Lift retrieval quality by re-scoring the top-N candidates from cheap retrieval using a semantic model that\'s too expensive to run on every document but cheap enough for the candidate set.

Motivating Problem

Traditional retrieval (BM25, vector search) is fast and recall-oriented. It returns reasonable candidates but the top-K ordering is often wrong — a relevant document appears at position 15, a less-relevant one at position 2. Users see only the top 5–10; relevant documents below that effectively don\'t exist.

Cross-encoder models score (query, document) pairs jointly, considering full interaction between query terms and document content. They\'re substantially more accurate than retrieval scores but too expensive to apply to millions of documents. The two-stage pattern — cheap retrieval first, expensive reranking on the top-N — captures the best of both.

How It Works

Stage 1: broad retrieval. Run lexical and/or vector retrieval to produce top 50–200 candidates. The recall target at this stage is high ('is the right answer in the top 200?'); ordering within the set matters less because reranking will reorder.

Stage 2: cross-encoder rerank. Run the cross-encoder model on each (query, candidate) pair. The model outputs a relevance score; sort candidates by score; return the top K (typically 5–10) as the final ranked list.

Cross-encoder model options. Three production paths: open-source pre-trained models (sentence-transformers/ms-marco-MiniLM, BAAI/bge-reranker, mixedbread.ai's mxbai-rerank). Hosted reranker APIs (Cohere Rerank, Voyage AI rerank, Jina rerank). LLM-as-judge with an off-the-shelf LLM (Claude, GPT) prompted to score relevance. Each option has different latency, cost, and quality characteristics.

Latency budget. Reranking a candidate set of 50 takes ~500ms with most cross-encoder models (batched on GPU) or via hosted APIs. This adds to the overall query latency. Production patterns: parallel retrieval and reranking pipeline; result streaming where the user sees initial results while reranking completes; aggressive caching of (query, top-50 candidate IDs) results.

Cost considerations. Per-query cost depends on model and provider. Self-hosted cross-encoders are cheapest after infrastructure amortization; hosted APIs charge per (query, doc) pair; LLM-as-judge is the most expensive per pair. For 1M queries/month against 50-document candidate sets, costs range from $50/month (self-hosted) to $5,000/month (LLM-as-judge with Sonnet-class model).

Quality measurement. Reranking improvement is measured in NDCG@K, MRR, or click-based proxies. Production patterns: A/B test the reranker against the baseline (no reranking); measure both quality and downstream metrics (click-through rate, conversion); maintain a regression suite of (query, expected top result) pairs that catches reranker regressions.

Failure handling. If the reranker fails (timeout, vendor outage, model error), fall back to the post-retrieval ranking unchanged. The fallback degrades quality but preserves availability. Production teams should test the fallback path regularly.

When to Use It

Almost any production retrieval system benefits from reranking. The investment is moderate (cross-encoder models are well-documented; hosted APIs are turnkey); the quality lift is substantial. The most common pattern is to start with a hosted reranker (Cohere or Voyage) for fast adoption, then evaluate self-hosting once the value is proven.

Less good fit — latency-critical systems where the additional 500ms is unacceptable. Pure-keyword search where lexical match is the primary signal. Very small candidate sets where the existing ordering is fine (e.g., navigational queries with a single expected result).

Sources

Cohere Rerank API documentation (docs.cohere.com)
sentence-transformers library documentation
Anthropic documentation on LLM-as-judge patterns
MTEB (Massive Text Embedding Benchmark) leaderboards for reranker model selection

Example artifacts

Code

# Two-stage retrieval with hosted reranker (Python)

from typing import List, Dict
import cohere

co = cohere.Client()

def hybrid_retrieve(query: str, k: int = 50) -> List[Dict]:
"""Stage 1: broad retrieval. Returns top-K candidates with text +
ID.
Implementation depends on your backend (Elasticsearch, vector DB,
etc).
This is a placeholder showing the expected return shape.
"""
# Run lexical (BM25) and vector retrieval in parallel
# Merge via RRF or weighted score
# Return top-K
return [
{"id": "doc_42", "text": "...", "score": 0.87},
{"id": "doc_88", "text": "...", "score": 0.81},
# ... up to k
]

def rerank(query: str, candidates: List[Dict], top_k: int = 10) ->
List[Dict]:
"""Stage 2: cross-encoder rerank. Re-orders candidates by semantic
relevance.
Falls back to the input ordering on failure.
"""
if not candidates:
return []
try:
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[c["text"] for c in candidates],
top_n=top_k,
return_documents=False,
)
# response.results is sorted by relevance, with .index pointing to
input position
reranked = []
for r in response.results:
original = candidates[r.index]
reranked.append({
**original,
"rerank_score": r.relevance_score,
})
return reranked
except Exception:
# Fall back to retrieval ordering
return candidates[:top_k]

def search(query: str, top_k: int = 10) -> List[Dict]:
"""End-to-end search: retrieve broadly, rerank narrowly."""
candidates = hybrid_retrieve(query, k=50)
return rerank(query, candidates, top_k=top_k)

Two-stage retrieval with cross-encoder reranking