Source: Information retrieval foundations; supported natively in Lucene-based engines and Coveo with positional index data
Classification — Pattern for matching multi-term queries as phrases or with position constraints, producing higher precision than bag-of-words BM25.
Boost or restrict matches based on the proximity and ordering of query terms within documents, capturing phrase semantics that bag-of-words scoring loses.
BM25 treats queries as bags of words: "red shoes" and "shoes red" produce identical scoring against documents. Many queries have phrase semantics that should affect ranking: "machine learning" as a phrase is much more specific than "machine" and "learning" separately; "New York" should preferentially match documents about the city rather than documents containing both words in unrelated contexts. Phrase and proximity matching capture these semantics with position-aware scoring.
Phrase queries. Strict phrase matching requires the query terms to appear consecutively in the document. Lucene-based engines support this through positional information in the inverted index: each posting includes the position(s) of the term within the document. The query processor verifies that candidate documents contain the query terms in the specified order at consecutive positions. Phrase queries can be combined with other queries (e.g., must match phrase, should boost on individual terms).
Proximity queries. Relaxed phrase matching allows the query terms to appear within a configurable number of positions of each other ("slop" in Lucene terminology). A slop of 0 is strict phrase matching; slop of 5 allows up to 5 intervening positions; slop of 50 allows the terms to appear anywhere in a reasonably-sized document. Proximity scoring typically boosts matches with smaller distances over matches with larger distances.
Field-specific behavior. Position information is typically maintained per-field; phrase queries operate within a single field. "New York" as a phrase matches title:"New York Times" but not separate occurrences of "New" in title and "York" in body. Production search typically combines per-field phrase matching with cross-field bag-of-words matching for balanced precision and recall.
Span queries. More sophisticated position constraints beyond simple phrase or proximity: "term A within N positions of term B, in either order, both within field C." Span queries support precise positional logic for specific use cases (legal citation matching, technical documentation, structured data extraction). The cost is query complexity and computational overhead; production use is selective.
Boost-not-filter pattern. Phrase and proximity matches are typically used as boost signals rather than as hard filters. A query like "red running shoes" might use phrase matching to boost documents where "running shoes" appears as a phrase while still retrieving documents where the terms appear separately. The pattern preserves recall while improving precision; pure phrase-required queries often produce too few results.
Queries with significant phrase semantics: product names ("iPhone 15 Pro"), proper nouns ("New York Times"), technical terms ("machine learning"), location names ("San Francisco"). E-commerce search where the difference between bag-of-words and phrase matches noticeably affects relevance. Domain-specific search (legal, medical, technical) where multi-word terms carry specific meaning.
Alternatives — dense vector retrieval (Section B) handles phrase semantics implicitly through embedding similarity; it's often used alongside lexical phrase matching in hybrid architectures. Query rewriting (future Query Understanding Catalog) can convert known multi-word terms into atomic tokens that BM25 then handles correctly.
- Lucene documentation on PhraseQuery and SpanQuery
- Elasticsearch / OpenSearch "match_phrase" and "match_phrase_prefix" query documentation
- Coveo phrase boost documentation