Source: Robertson, Walker, Jones et al., "Okapi at TREC-3" (1995); the foundational lexical scoring function in Elasticsearch, OpenSearch, Solr, and most lexical search engines

Classification — The dominant lexical retrieval pattern — token-based scoring with term frequency and inverse document frequency adjustments.

Intent

Retrieve documents based on lexical overlap between query and document, with scoring that accounts for term frequency saturation and document length normalization in a way that produces stable, predictable rankings across heterogeneous corpora.

Motivating Problem

Pure term-frequency scoring overweights documents that repeat query terms; pure presence-or-absence scoring loses ranking signal. Documents of different lengths get unfair advantages or disadvantages. The simplest scoring functions (TF-IDF) handle the basic case but produce known artifacts: long documents over-rank because they accumulate more term occurrences; documents that mention a query term many times rank artificially high. BM25 addresses these issues with two parameters — k1 (term frequency saturation) and b (length normalization) — that make the scoring stable and tunable across corpora.

How It Works

The scoring function. For each query term in each candidate document, BM25 computes a score component that combines: term frequency in the document (saturating, so the 10th occurrence adds less score than the 2nd), inverse document frequency (rare terms get higher weight than common ones), and document length normalization (longer documents are penalized relative to the average length). The components multiply per term; per-term scores sum across all query terms.

Parameter tuning. The k1 parameter (typical range 1.2–2.0) controls term frequency saturation: lower k1 means term frequency saturates faster (additional occurrences add less), higher k1 means term frequency continues to add score with each occurrence. The b parameter (typical range 0.5–1.0, default 0.75) controls length normalization: b=1 fully normalizes for length, b=0 ignores length entirely. Defaults work well for many corpora; tuning per corpus can improve quality measurably but requires evaluation infrastructure.

Implementation. BM25 sits on top of an inverted index: a data structure that maps each term to the list of documents containing it (a postings list). Query processing iterates over the query terms, retrieves the postings lists, and accumulates BM25 scores for each candidate document. Top-K candidates are returned. Modern implementations (Lucene, the underlying library of Elasticsearch, OpenSearch, and Solr) optimize this with skipping, early termination, and SIMD instructions.

Variants. BM25F extends BM25 to multi-field documents, allowing per-field weighting (e.g., title weighted higher than body). BM25+ addresses a known weakness with very long documents in certain corpora. BM25L addresses the long-document case differently. These variants are used in specific cases; vanilla BM25 with reasonable parameter tuning handles most production needs.

Strengths and limits. BM25 excels at literal matches, rare terms, and identifier-heavy queries. It produces explainable matches (which terms contributed how much). It scales to billions of documents at sub-100ms latency on modest hardware. It fails at semantic matching (synonyms, paraphrases, conceptual queries) without separate term-expansion or synonym infrastructure. The hybrid era (Chapter 3) addresses these limits by combining BM25 with semantic retrieval rather than replacing BM25.

When to Use It

Almost every production search system as at least one retrieval path. Cold-start systems without labeled training data. Systems with identifier-heavy queries (e-commerce product SKUs, technical documentation with error codes, legal documents with citation patterns). One path within a hybrid retrieval architecture.

Alternatives — dense vector retrieval (Section B) when semantic matching matters and lexical patterns are insufficient. Hybrid retrieval (Section C) for the common case where both semantic and lexical signals matter. Pure vector retrieval is rarely the right answer alone for most production systems; BM25 typically remains one path in the architecture.

Sources

Robertson and Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)
Manning, Raghavan, Schütze, Introduction to Information Retrieval (free online, ch. 11)
Trey Grainger, AI-Powered Search (2024) chapters on lexical foundations
Elasticsearch / OpenSearch / Solr documentation for production implementations

Example artifacts

Code

// Elasticsearch / OpenSearch BM25 query (REST API)
// The default "similarity" for text fields is BM25; this is the
canonical match query
GET /products/_search
{
"query": {
"multi_match": {
"query": "running shoes",
"fields": [
"title^3", // Boost title 3x (per-field weighting; BM25F-style)
"description",
"brand^2"
],
"type": "best_fields"
}
},
"size": 100
}

// To customize BM25 parameters on an index:
PUT /products/_settings
{
"index": {
"similarity": {
"custom_bm25": {
"type": "BM25",
"k1": 1.5, // term frequency saturation (default 1.2)
"b": 0.7 // length normalization (default 0.75)
}
}
}
}

// Solr equivalent (in schema.xml or schema-managed config)
// <similarity class="solr.BM25SimilarityFactory">
// <float name="k1">1.5</float>
// <float name="b">0.7</float>
// </similarity>

BM25 retrieval