Source: Production methodology at major search teams; Grainger, AI-Powered Search; Lefortier et al. on online evaluation

Classification — The discipline of designing, validating, and managing features for ranking models.

Intent

Build a feature set that contributes meaningfully to ranking quality, validate each feature's value through ablation, and manage the feature pipeline at production scale with consistency between training and serving.

Motivating Problem

Adding features to an LTR model without ablation is the most common mistake in feature engineering. A feature that looks plausible may add noise rather than signal; many features that domain experts believe should help turn out not to. Without ablation, the model accumulates marginal-or-negative features that bloat the model and confuse interpretation. The discipline of feature engineering involves not just generating candidate features but rigorously validating which ones contribute.

How It Works

Feature generation. Start from the four categories (Chapter 3): query features, document features, query-document features, context features. For each category, brainstorm candidate features based on domain knowledge: what signals would a domain expert use to determine relevance? E-commerce: brand reputation, inventory level, price competitiveness, sales velocity. Enterprise search: document recency, author authority, document type. The brainstorming produces a candidate list; the validation determines which candidates earn places in the production model.

Feature computation. Each feature is implemented as a function: given a query and a document (and possibly context), return a numeric value. The function runs at training time and at query time; the implementations must produce identical values in both contexts. Training-serving skew (the feature value differs between training and serving) is a major production failure mode that requires careful engineering to prevent. The implementations are typically version-controlled code, deployed as part of the search service, with explicit testing for training-serving consistency.

Single-feature validation. Before including a feature in the model, validate it independently: does it correlate with relevance? Compute the feature for all (query, document) pairs in the judgment list; check whether higher feature values correlate with higher relevance grades. Spearman rank correlation is a common metric; correlation above 0.1 is a weak signal worth considering; above 0.3 is moderate; above 0.5 is strong. Features with near-zero correlation are unlikely to add value to the model.

Ablation studies. The gold-standard validation: train the model with and without each feature; measure the NDCG change. Features that improve NDCG meaningfully when added (or decrease it meaningfully when removed) are contributors; features that don't affect NDCG are non-contributors that should be removed. Ablation is expensive (each feature requires a full model training) but produces definitive answers; production teams typically ablate features in batches rather than individually.

Feature importance from trained models. GBDT models report feature importance (gain or split count). Features with low importance are candidates for removal; features with high importance are core to the model. Importance alone isn't sufficient for removal decisions (high-importance features could still be redundant with each other; low-importance features could still be necessary for specific query types), but it's a starting point for ablation prioritization.

Feature pipeline infrastructure. At small scale, features are computed inline in training and serving code. At larger scale, feature stores (Feast, Tecton, Hopsworks, or custom platforms) manage feature definitions, computation, and serving. The feature store provides: consistent feature definitions across training and serving; pre-computed features for offline access; online feature serving for query-time lookup; feature versioning and rollback. The infrastructure investment is substantial; production teams typically build it once feature count and complexity justify the engineering.

Feature monitoring in production. Production features can drift: a feature that depended on an upstream signal may produce stale or wrong values if the upstream signal changes. Production monitoring: track per-feature statistics over time (mean, variance, distribution); alert when distributions shift significantly; investigate root causes. The monitoring is operations discipline; the planned Volume 6 (Search Operations) covers the broader discipline.

When to Use It

Every production LTR deployment needs feature engineering discipline. The investment compounds: features built once support many model iterations; the validation methodology applies to every new feature added. Teams without this discipline typically build models with many features that don't contribute, producing bloated models that are hard to maintain.

Alternatives — neural rerankers (Section D) that operate directly on raw text rather than engineered features bypass much of the feature engineering work, at the cost of higher computational requirements and less interpretability. Feature engineering remains essential for the LTR portion of cascade architectures even when neural rerankers handle the top-K stage.

Sources

Grainger, AI-Powered Search, chapters on feature engineering for relevance
Production methodology writings from search teams at Etsy, Spotify, others
Feast feature store documentation (feast.dev)

Feature engineering and ablation methodology