Source: Production methodology; Cormack et al. (2009) original RRF paper; current platform documentation (Elasticsearch, Vespa, etc.)

Classification — Pattern for combining results from multiple retrieval methods (lexical + vector + LLM-derived signals) into a single ranked list without needing per-method score calibration.

Intent

Combine the recall of lexical matching, the semantic understanding of vector search, and the optional LLM-augmented signals into a unified retrieval result that's better than any single method alone.

Motivating Problem

Lexical retrieval (BM25) is strong on exact-term matches, weak on synonyms and paraphrasing. Vector retrieval is strong on semantic matches, weak on exact terms (especially proper nouns, product SKUs, identifiers). Each method retrieves documents the other misses. A hybrid approach gets both.

Combining results requires a fusion method. Raw score combination is brittle — BM25 and vector scores are on different scales; weighted combinations require careful calibration that changes when models update. RRF avoids this by using only rank positions, not scores. The result is a robust fusion method that requires zero tuning.

How It Works

Run each retrieval method independently. Lexical (BM25 against tokenized index) and vector (cosine similarity against embedding index) execute in parallel; each returns its own top-K ranked list. Production patterns: parallelize the calls; cap K at 50–100 per method; ensure consistent document IDs so the same document is recognized across methods.

Compute RRF score per document. For each document, the RRF score is the sum across methods of 1/(k + rank_in_that_method), where rank is the 1-indexed position in that method's ranked list, and k is a constant (typically 60). Documents that rank high in any method get high RRF scores; documents that rank moderately in multiple methods can outscore documents that rank top in only one.

Sort by RRF score. The unified ranked list is just the documents sorted by descending RRF score. Truncate to your target top-K (typically 20–50 for downstream reranking, 5–10 for direct presentation).

Per-method weighting (optional). If you want one method to count more than another, multiply each method's contribution by a weight. Production practice: start with equal weighting; only deviate when measurement justifies it. Often the equal-weight RRF is good enough that tuning weights doesn't pay off.

Adding LLM-derived signals. Two patterns. First, treat LLM query rewriting as a separate retrieval branch: rewrite the query with an LLM, run retrieval on the rewritten query, add the result as another input to RRF. Second, treat LLM reranking (Section C) as a post-fusion stage rather than a fusion input: RRF combines lexical + vector; then the reranker re-orders the unified top-K. Both patterns work; the choice depends on where the LLM cost is most easily absorbed.

Validation. RRF is well-studied; production teams should still validate it against their workload. Maintain offline measurement (NDCG, MRR on judged sets) comparing RRF against single-method baselines and against weighted combinations. Online A/B testing for production rollout. The validation effort is modest because RRF has few tuning parameters.

When to Use It

Almost any modern production retrieval. The combination of lexical + vector + RRF is the default starting point through 2024–2026; teams that haven't adopted it are leaving recall on the table. The implementation is straightforward; the quality lift is consistent across workloads.

Less good fit — workloads with extreme latency budgets where running two parallel retrieval methods is too expensive. Workloads where one method is dramatically better than the other (rare in practice).

Sources

Cormack, Clarke, Buettcher (2009) 'Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods'
Elasticsearch RRF documentation
Vespa documentation on hybrid retrieval
Anthropic documentation on hybrid search for RAG

Example artifacts

Code

# Reciprocal Rank Fusion (RRF) implementation (Python)

from typing import List, Dict, Iterable
from collections import defaultdict

def rrf_fuse(
ranked_lists: Iterable[List[Dict]],
k: int = 60,
weights: List[float] = None,
id_field: str = "id",
) -> List[Dict]:
"""Fuse multiple ranked lists using Reciprocal Rank Fusion.
Args:
ranked_lists: Iterable of ranked lists. Each list is sorted by
relevance descending.
Each item must have an `id` field (or whatever id_field specifies).
k: RRF constant (default 60). Higher k softens the rank influence.
weights: Optional per-method weights. Default: equal weighting.
id_field: The field on each item that uniquely identifies a document.
Returns:
Unified ranked list sorted by descending RRF score, with
`rrf_score` added.
"""
ranked_lists = list(ranked_lists)
if weights is None:
weights = [1.0] * len(ranked_lists)
assert len(weights) == len(ranked_lists)
# Accumulate scores per document
scores = defaultdict(float)
docs_by_id = {}
for weight, ranked_list in zip(weights, ranked_lists):
for rank, doc in enumerate(ranked_list, start=1):
doc_id = doc[id_field]
scores[doc_id] += weight * (1.0 / (k + rank))
# Keep the first occurrence\'s full doc (preserves text, metadata)
if doc_id not in docs_by_id:
docs_by_id[doc_id] = doc
# Build unified ranked list
fused = []
for doc_id, score in sorted(scores.items(), key=lambda x: -x[1]):
doc = dict(docs_by_id[doc_id])
doc["rrf_score"] = score
fused.append(doc)
return fused

# Example usage:
lexical_results = [
{"id": "doc_42", "text": "...", "bm25": 12.4},
{"id": "doc_88", "text": "...", "bm25": 11.1},
{"id": "doc_15", "text": "...", "bm25": 9.8},
]
vector_results = [
{"id": "doc_88", "text": "...", "cosine": 0.92},
{"id": "doc_71", "text": "...", "cosine": 0.89},
{"id": "doc_42", "text": "...", "cosine": 0.84},
]

# Equal-weighted fusion
fused = rrf_fuse([lexical_results, vector_results])
# Top results: doc_42 and doc_88 (both rank high in both methods)
win over
# doc_71 (only appears in vector) and doc_15 (only appears in
lexical)

Hybrid retrieval with Reciprocal Rank Fusion (RRF)