Source: Burges, "From RankNet to LambdaRank to LambdaMART: An Overview" (Microsoft Research, 2010); production implementations in LightGBM ranking, XGBoost ranking, RankLib
Classification — The dominant production LTR algorithm for a decade-plus — gradient-boosted decision trees with metric-aware gradients.
Train a learned ranking model from labeled training data that combines many features (50–500 typical) into per-document scores optimized for ranking metrics (NDCG, MAP) rather than for pointwise regression accuracy.
Hand-tuned scoring (BM25 plus boosts) scales to a handful of signals. As ranking signal sources multiplied through the 2000s — dozens then hundreds of signals per query-document pair — manual tuning became intractable. Early LTR methods (RankNet, RankBoost) treated ranking as classification or regression problem, missing the ranking-specific aspects of the metric. LambdaMART addressed this by combining gradient-boosted decision trees (which scale well to many features) with metric-aware gradients (which let the training process optimize for ranking metrics directly).
The training data. LTR training data consists of (query, document, relevance grade) triples. The relevance grades come from judgment lists (Volume 5 Section A) or from click logs with bias correction (Volume 5 Section E). The training data typically has 100–1000 queries with 20–100 documents per query; the queries are sampled to be representative of production traffic.
The model architecture. A gradient-boosted decision tree ensemble: hundreds of small decision trees combined via boosting. Each tree takes features as input and outputs a contribution to the document's score; the trees' contributions sum to produce the final score. The ensemble has many parameters (typically thousands of tree nodes across the ensemble) but trains efficiently because each tree is small and the boosting framework handles the combination.
The lambda gradient. The breakthrough in LambdaRank/LambdaMART. Rather than computing gradients based on per-document loss, the lambda gradient computes per-pair gradients weighted by the change in ranking metric the pair swap would produce. If swapping two documents would change NDCG by a lot, the gradient on that pair is large; if swapping wouldn't affect NDCG much, the gradient is small. The metric-aware gradient lets training optimize for the ranking metric directly, which produces measurably better top-K rankings than pointwise or naive pairwise approaches.
Pointwise, pairwise, listwise framing. Pointwise treats each (query, document, grade) as an independent regression problem: predict the grade. Pairwise treats each pair as a classification problem: which document of the pair should be ranked higher. Listwise considers the full ranked list at once. LambdaMART is fundamentally pairwise with metric-aware gradients; the lambda terms incorporate listwise information without requiring fully listwise training. The framing matters because different implementations support different framings; choose based on the framing the production training data supports.
Production implementations. LightGBM ranking (Microsoft's gradient boosting library with LambdaRank objective) is widely used in production. XGBoost ranking (similar capability in XGBoost) is a popular alternative. RankLib provides Java implementations including LambdaMART. Each implementation has tuning parameters: tree count, tree depth, learning rate, leaf count, regularization. Production teams typically tune via cross-validation against held-out judgments.
Inference. At query time, the trained model is applied to each candidate document's feature vector. For 1000 candidates with 100 features and a 500-tree model, inference takes roughly 5–20ms on CPU; faster with optimized libraries. Production systems integrate LTR inference into the ranking pipeline (often via plugins like Elasticsearch's Learning to Rank plugin or Solr's LTR contrib module).
Production search systems with sufficient training data (judgment lists or bias-corrected click logs) to train a model. The dominant LTR algorithm in production for mid-stage ranking. Often used between first-stage retrieval and neural reranking (Section D) in cascade architectures.
Alternatives — simpler hand-tuned scoring for cold-start without training data. Neural LTR (next entry) for very large training sets where GBDT may underutilize the data. Neural rerankers (Section D) for top-K reranking where their quality wins justify the cost. LambdaMART/GBDT remains the default for production LTR.
- Burges, "From RankNet to LambdaRank to LambdaMART: An Overview" (Microsoft Research, 2010)
- Liu, Learning to Rank for Information Retrieval (Foundations and Trends, 2009)
- LightGBM ranking documentation (lightgbm.readthedocs.io)
- XGBoost ranking documentation (xgboost.readthedocs.io)
- Elasticsearch Learning to Rank plugin (elasticsearch-learning-to-rank.readthedocs.io)
Code
# LightGBM LTR with LambdaRank objective
import lightgbm as lgb
import numpy as np
import pandas as pd
# Training data format:
# - features: (n_samples, n_features) matrix
# - labels: relevance grades (0-4 typical)
# - group: array of per-query document counts
# e.g. group=[20, 30, 25] means query 1 has 20 docs, query 2 has
30 docs, etc.
X_train = features_df[FEATURE_COLS].values # shape: (N,
n_features)
y_train = features_df[\'relevance_grade\'].values # shape: (N,)
groups_train = features_df.groupby(\'query_id\').size().values #
docs per query
# Create LightGBM dataset with group info (essential for ranking)
train_dataset = lgb.Dataset(
X_train,
label=y_train,
group=groups_train
)
# Train with LambdaRank objective
params = {
\'objective\': \'lambdarank\',
\'metric\': \'ndcg\',
\'ndcg_eval_at\': [3, 5, 10], # evaluate NDCG@3, NDCG@5, NDCG@10
\'learning_rate\': 0.05,
\'num_leaves\': 31,
\'min_data_in_leaf\': 50,
\'lambdarank_truncation_level\': 10, # focus on top-10 positions
\'verbose\': -1,
}
model = lgb.train(
params,
train_dataset,
num_boost_round=500,
valid_sets=[validation_dataset],
callbacks=[lgb.early_stopping(stopping_rounds=20)]
)
# Inference at query time
def rank_candidates(query, candidates):
# Extract features for each (query, candidate) pair
feature_matrix = np.array([
extract_features(query, doc) for doc in candidates
])
# Score each candidate
scores = model.predict(feature_matrix)
# Sort by score descending
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, score in ranked]
# Feature importance for debugging
importance = model.feature_importance(importance_type=\'gain\')
feature_importance_df = pd.DataFrame({
\'feature\': FEATURE_COLS,
\'importance\': importance
}).sort_values(\'importance\', ascending=False)
print(feature_importance_df.head(20))