Source: Elasticsearch / OpenSearch / Solr schema documentation; Grainger, AI-Powered Search; production methodology at major search teams
Classification — The pattern for designing document schemas that support diverse query patterns through deliberate per-field decisions and sub-field patterns.
Design a document schema where each field's type, analyzer, and storage decisions support the specific query behaviors the system needs to handle, using sub-field patterns to support multiple match modes from a single source content.
Default schemas (one analyzer per field, one match mode per field) produce search experiences that can't handle the variety of query patterns production traffic includes. Navigational queries want exact phrase match; informational queries want stemmed match; autocomplete wants prefix match; faceted browsing wants keyword-style filtering. A schema with one analyzer per field forces compromises: aggressive stemming helps informational queries but hurts navigational ones; conservative analysis helps navigational queries but misses informational matches. The sub-field pattern resolves this by indexing the same content multiple ways with different analyzers; queries can target the appropriate sub-field per match mode.
The field as primary unit. Documents are collections of named fields; each field has a type, analyzer, and storage decisions. The choice of field structure determines what queries can do: a single text field with one analyzer supports one match behavior; multiple sub-fields with different analyzers support multiple match behaviors.
Type choices. Text fields are tokenized and matched lexically (supports phrase queries, multi-word matching, scoring with BM25). Keyword fields store the entire value as one token (supports exact match, filtering, sorting, faceting). Numeric fields (integer, long, float, double, scaled_float) support range queries and sorting. Date fields support time-based queries and decay functions. Boolean fields are filters. Vector fields support dense retrieval (Volume 1 Section B). Geo fields support spatial queries. The choice fits the queries; mismatched types force expensive workarounds.
Sub-field patterns. The pattern of indexing one source content into multiple field variants. The most common: text field with analyzer for general matching; .exact sub-field with keyword analyzer for exact match boost; .ngram sub-field with edge n-gram for autocomplete. The sub-fields share source content; Elasticsearch and OpenSearch handle the multi-field indexing automatically when the schema declares the sub-fields. Queries can target individual sub-fields or boost across multiple sub-fields with appropriate weights.
Analyzer assignment. Per-field analyzer choice depends on intended match behavior. Text fields for natural language: english_full (stemming + synonyms + ASCII fold) for general body text; english_minimal (stemming only) for titles where synonyms might over-broaden. Text fields for identifiers: keyword analyzer (no tokenization). Text fields for autocomplete: edge_ngram analyzer. The analyzer assignment matters for every text-typed field; the choice between text and keyword type for a given field is often the most consequential decision.
Storage decisions. Indexed fields are matchable (queries can match them). Stored fields are retrievable (results can display them). doc_values enable sorting and aggregation. Most fields are all three; specific patterns deviate: a vector field is indexed but not stored (the vector is used for retrieval but not displayed); a pre-rendered HTML field is stored but not indexed (it's for display only). The decisions affect index size and performance; reducing storage on fields that don't need it reduces index footprint.
Nested vs object fields. For structured attributes (color, size, material), the choice is between object fields (each attribute as a top-level field: attributes.color, attributes.size, attributes.material) and nested fields (the attribute is a nested object with cross-field relationships preserved). Nested fields are appropriate when the attributes have meaningful relationships (a size and color combination represents one variant); object fields are appropriate when attributes are independent. The choice affects query expressiveness; nested fields support queries like "red AND size 10 on the same variant" that object fields can't.
Vector field design. Vector fields store dense embeddings for semantic retrieval. Decisions: dimension count (matches the embedding model: 384, 768, 1024, 1536, ...); ANN structure (HNSW typical; alternative: IVF for very large indices); similarity metric (cosine for normalized vectors, equivalent to dot product); element type (float32 standard; int8 quantization for storage reduction). Production schemas typically have 2–3 vector fields (title_vec, body_vec, summary_vec) supporting different match modes.
Schema evolution. Schemas change over time as the workload evolves. Adding fields is generally safe (existing documents have null values for new fields). Changing field types or analyzers requires reindexing (Section F covers the patterns). The discipline is designing schemas with foresight — anticipating likely future fields and reserving naming conventions — but accepting that some schema evolution will require operational work.
Every production search system has a schema, whether deliberately designed or default. The pattern applies universally; the question is whether the design is deliberate. Teams that have not validated their schema design typically have known unknowns in their search capability.
Alternatives — schema-less or document-store retrieval for specific use cases where full-text search isn't the goal. The discipline of schema design remains essential for any production search system.
- Elasticsearch mapping documentation (elastic.co)
- OpenSearch mapping documentation
- Solr schema documentation
- Grainger, AI-Powered Search, chapters on schema design
Schema / config
// Elasticsearch production schema with sub-fields and vector fields
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"english_full": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "english_stop",
"english_stemmer"]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
},
"edge_ngram": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edge_ngram_filter"]
}
},
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english_full",
"fields": {
"exact": { "type": "text", "analyzer": "keyword_lowercase"
},
"ngram": { "type": "text", "analyzer": "edge_ngram" }
}
},
"brand": { "type": "keyword" },
"category": { "type": "keyword" },
"description": { "type": "text", "analyzer": "english_full"
},
"price": { "type": "scaled_float", "scaling_factor": 100 },
"price_tier": { "type": "keyword" },
"attributes": {
"type": "nested",
"properties": {
"name": { "type": "keyword" },
"value": { "type": "keyword" }
}
},
"title_vec": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
},
"body_vec": {
"type": "dense_vector",
"dims": 1024,
"index": true,
"similarity": "cosine"
},
"popularity": { "type": "float" },
"freshness": { "type": "date" },
"sku": { "type": "keyword" },
"in_stock": { "type": "boolean" }
}
}
}