RelevantSearch.AI
Pattern · Volume 02 · Section A --- Tokenization and normalization patterns · Updated May 2026

The Lucene-style analyzer chain

Source: Apache Lucene; Elasticsearch, OpenSearch, Solr analyzer documentation; Manning et al., Introduction to Information Retrieval

Classification — The structural pattern for production tokenization: CharFilter → Tokenizer → TokenFilter chain that runs at both index time and query time.

Intent

Process query and document text into matchable tokens using a configurable chain of character-level filters, tokenization, and per-token transformations, with the same chain applied at index time and query time to ensure consistent matching.

Motivating Problem

Default tokenizers handle the basic case (split on whitespace and punctuation; lowercase) but miss everything that makes production search work well. Language-specific tokenization, domain-specific token handling (SKUs, identifiers, mixed scripts), stemming choices, stop word handling, ASCII folding for international content — all of these require explicit configuration. The cost of getting it wrong is invisible matches: documents that should match a query don't because the analyzer chain produced different tokens for the query than for the document. The cost of inconsistent chains (different processing at index time vs query time) is the same failure mode at higher impact.

How It Works

CharFilter stage. Operates on the raw character stream before tokens exist. Used for: HTML stripping (remove HTML tags from indexed content); pattern replacement (regex-based character substitutions); character mapping (map specific characters to others, like normalizing curly quotes to straight quotes). The stage is rarely the bottleneck but matters when the input has structured markup or character variations the downstream stages can't handle.

Tokenizer stage. Splits the character stream into tokens. Lucene provides many built-in tokenizers: standard (whitespace + punctuation, good for Western languages); keyword (no splitting; the entire input is one token); n-gram (produces overlapping character n-grams, useful for substring matching); edge n-gram (n-grams anchored to word starts, useful for autocomplete); language-specific tokenizers for CJK (Chinese/Japanese/Korean) languages that don't use whitespace; ICU tokenizer for sophisticated multi-language handling. The choice depends on the language and the matching behavior wanted.

TokenFilter chain. Token-level transformations applied in sequence. Common filters: lowercase (case normalization); ASCII folding (remove diacritics: café → cafe); stop word removal (filter out "the", "and", etc.); stemming (Porter for English, Snowball for multi-language; reduces morphological variants to a root); synonym expansion (add synonyms inline using a SynonymGraphFilter); n-gram (produce character n-grams for substring matching); shingle (produce token n-grams for phrase-like matching). Order matters: lowercase before stemming (stemmers expect lowercase input); ASCII fold before stemming for non-English languages.

Index-time vs query-time analyzers. The most common deployment uses the same analyzer at both times — the document's tokens are produced by the same chain that processes queries against it. Mismatches cause invisible matches: documents indexed with stemming match queries processed without stemming only when the query already contains the stemmed form. Some patterns deliberately use different chains: less aggressive query-time analysis (the user's exact query terms are preserved) with more aggressive index-time analysis (multiple synonym expansions baked into the index). The asymmetry is intentional and documented; accidental asymmetry is a bug.

Per-field analyzers. Different fields may need different analyzers. A product title gets standard tokenization with stemming; a product SKU field gets keyword tokenization (no splitting, exact match); a description field gets stemming and stop word removal; a brand field gets keyword tokenization with case normalization. The per-field approach lets each field's matching behavior be tuned independently; production teams often have 5–10 distinct analyzer configurations in a mature schema.

Multilingual content. Mixed-language corpora need careful analyzer design. Options: detect language at index time and apply language-specific analyzers per document (works when language is identifiable); use ICU tokenizer with multi-language token filters (works for many cases but loses some language-specific behavior); use the same analyzer for all languages and accept reduced quality on non-dominant ones (simple but limiting). The best choice depends on the language distribution and the importance of each language to the workload.

When to Use It

Every production search system has an analyzer chain whether the team configured it deliberately or not. The pattern applies universally; the question is whether the configuration was explicit and validated or accepted as default. Teams that have not validated their analyzer configuration typically have known unknowns in their search quality.

Alternatives — keyword-only matching (no tokenization or normalization) for specific fields where exact match is required. Pure vector matching (Volume 1 Section B) bypasses the analyzer chain entirely; some production systems use vector matching as primary retrieval with lexical matching as fallback. The analyzer chain remains foundational for the lexical portion of any hybrid system.

Sources
  • Apache Lucene documentation (lucene.apache.org)
  • Elasticsearch analyzer documentation
  • OpenSearch analyzer documentation
  • Solr analyzer documentation
  • Manning et al., Introduction to Information Retrieval, ch. 2
Example artifacts

Code

// Elasticsearch / OpenSearch custom analyzer for English e-commerce
content
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"english_ecommerce": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"english_stop",
"english_stemmer",
"product_synonyms" // custom synonym filter, defined below
]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english" // Porter-like stemmer
},
"product_synonyms": {
"type": "synonym_graph",
"synonyms": [
"sneakers, running shoes, athletic shoes, trainers",
"tv, television",
"laptop, notebook computer"
]
}
}
}
},
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "english_ecommerce"
},
"description": { "type": "text", "analyzer":
"english_ecommerce" },
"brand": { "type": "text", "analyzer": "keyword_lowercase"
},
"sku": { "type": "keyword" } // no tokenization; exact match
only
}
}
}

// Test the analyzer with a sample query
GET /products/_analyze
{
"analyzer": "english_ecommerce",
"text": "Running Shoes & Sneakers for Mén"
}
// Expected output:
// [run, shoe, sneaker, athletic, shoe, trainer, men]
// (lowercased, ASCII-folded, stemmed, stop words removed, synonyms
expanded)

Read in context within Volume 02 →