RelevantSearch.AI
Pattern · Volume 02 · Section F --- Query expansion patterns · Updated May 2026

Synonym management and query expansion strategies

Source: Lucene SynonymGraphFilter; Word2Vec (Mikolov et al., 2013); modern embedding-based expansion; LLM-based expansion patterns; Grainger on synonyms in AI-Powered Search

Classification — Methods for bridging vocabulary gaps between query terms and document terms.

Intent

Expand queries (or documents) with related terms so matches succeed despite vocabulary mismatch between user queries and document content, using a combination of manual, learned, and AI-generated synonym sources.

Motivating Problem

Vocabulary mismatch is the dominant failure mode in lexical search. Users say "sneakers"; documents say "running shoes". Users say "TV"; documents say "television". Users say "pain reliever"; documents say "analgesic". Pure lexical match misses all of these. Vector retrieval handles many cases via semantic similarity in embedding space, but lexical match remains foundational (Vol 1 Section A), and explicit synonyms still matter for cases the embeddings don't handle well — acronyms, brand variations, specific terminology.

How It Works

Manual synonym lists. Curated mappings between terms: "shoes, sneakers, footwear" as a synonym group. The lists are applied in the analyzer chain (Section A) at index time or query time. Index-time expansion (apply synonyms when indexing documents) increases index size but produces no query-time overhead; query-time expansion (apply synonyms when processing queries) keeps the index lean but adds query-time work. Both are valid; the choice depends on whether the team prefers index-side or query-side complexity.

Synonym list construction. Domain experts identify high-value synonym pairs based on knowledge of the domain. The lists range from dozens (small workloads) to thousands (large e-commerce). Production teams typically maintain lists in version control with explicit review processes for changes — synonym changes affect every query and document, so changes warrant evaluation.

Bidirectional vs unidirectional synonyms. A bidirectional synonym ("shoes ↔ sneakers") means either term matches the other. A unidirectional synonym ("sneakers → shoes") means queries for sneakers also retrieve shoe documents, but queries for shoes don't retrieve sneaker-specific documents. Unidirectional handles broader-narrower relationships: "sneakers → shoes" works (sneakers are shoes) but "shoes → sneakers" wouldn't (not all shoes are sneakers).

Co-click expansion from logs. Mining production click logs for synonym pairs. Algorithm: for each query, find the set of documents that received clicks; for each pair of queries with overlapping click sets, compute the overlap fraction; queries with high overlap are candidate synonyms. The method captures actual user vocabulary at scale, including idiosyncratic terms (slang, regional variations) that manual curation might miss. Confidence thresholds matter: queries with only one or two shared clicks aren't reliable synonyms; the threshold should be tuned per workload.

Embedding-based expansion. Pre-trained embeddings (Word2Vec, GloVe, sentence-transformers) place semantically similar terms close in vector space. Find nearest neighbors of query terms in embedding space; consider them candidate synonyms. The method captures semantic similarity automatically but produces noise: "shoes" and "socks" are semantically related in many embedding spaces, but expanding a query for shoes to include socks is wrong. Filtering candidates by additional signals (co-occurrence in production data, manual review) improves precision.

LLM-generated expansion. Prompt an LLM with the query: "Generate 5 synonyms or closely-related terms for the query 'red running shoes' in an e-commerce context." The LLM produces context-aware expansions that earlier methods miss: multi-word phrases ("athletic footwear"), brand-specific terms, idiomatic equivalents. Production deployments precompute LLM expansions for high-volume queries (cache the results) and use real-time LLM calls only for unusual queries. The pattern handles edge cases that simpler methods struggle with.

Combining sources. Production deployments combine: manual synonyms for the high-value queries (the ones that drive substantial business value, where exact control matters); co-click expansion at scale (cheap, automatic, captures real usage); embedding expansion for cold-start cases (when click logs aren't available); LLM expansion for conversational queries where context matters. The combination produces broader coverage than any single method. Production teams typically version their synonym infrastructure separately from manual and learned synonyms, so changes can be tracked and reverted independently.

Evaluation. Synonym additions change retrieval behavior; they need evaluation. Methods from Volume 5: maintain golden query sets that exercise synonym-dependent queries; track precision and recall before/after synonym changes; A/B test substantial changes. Production teams without synonym evaluation typically accumulate synonym rules that individually seemed helpful but collectively degrade quality.

When to Use It

Almost every production search system uses some form of query expansion. E-commerce especially benefits from manual synonyms for the high-value queries plus co-click expansion at scale. Domain-specific search (legal, medical, technical) where standard terminology and user vocabulary differ benefits substantially. Multi-lingual search where translation-as-synonym handles cross-lingual matching.

Alternatives — pure vector retrieval that handles semantic similarity implicitly (Volume 1 Section B). Hybrid retrieval that combines lexical with semantic. Most production systems use synonyms even in hybrid setups because synonyms remain valuable for specific cases that pure vector retrieval handles imperfectly.

Sources
  • Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (Word2Vec, 2013)
  • Lucene SynonymGraphFilter documentation
  • Elasticsearch / OpenSearch synonym documentation
  • Grainger, AI-Powered Search, chapters on synonyms and expansion

Read in context within Volume 02 →