Source: Broder, "A Taxonomy of Web Search" (2002); Jurafsky and Martin, Speech and Language Processing; production methodology at major e-commerce and consumer search companies

Classification — Methods for classifying query intent into discrete classes (navigational, informational, transactional, conversational, ...) for routing and feature use.

Intent

Classify each query into intent classes with confidence scores, supporting downstream routing decisions and providing features for ranking models.

Motivating Problem

Different intents deserve different retrieval architectures. Navigational queries want exact-match-first retrieval; informational queries want hybrid retrieval with diversity; conversational queries want RAG-style retrieval. Without classification, the system applies one architecture to all queries, compromising results for at least some intents. Classification produces the signal that lets routing handle different intents differently.

How It Works

Rule-based classification. Heuristics based on query characteristics. Examples: queries starting with question words (who, what, when, where, why, how) are informational or conversational; queries containing currency symbols or terms like "buy", "order", "price" are transactional; queries with single product names or SKUs are navigational; queries longer than ~6 tokens with natural-language structure are conversational. Rules are simple, interpretable, and cheap to evaluate; they produce clear classifications but don't generalize well to query distributions the rules weren't designed for.

Rule-based limitations. The rules need maintenance as query patterns evolve. New query types (e.g., emoji queries, voice-input queries with characteristic punctuation patterns) require rule updates. The rules also have correctness limits: "nike air max" doesn't obviously fit one rule but is clearly navigational; "good running shoes" doesn't obviously fit one rule but is clearly informational. Pure rule-based systems leave many queries miscategorized or unclassified.

ML-based classification. Train a classifier on labeled queries: each query labeled with its true intent class. The classifier learns features that correlate with each class. Standard ML methods: logistic regression for fast inference and interpretable feature importance; gradient boosting (LightGBM, XGBoost) for higher accuracy; transformer-based classifiers (BERT fine-tuned for classification) for highest accuracy at higher cost. Features can be hand-engineered (query length, presence of question words, token IDFs) or learned (encoder embeddings). The classifier produces a class label and a confidence score.

Training data. The classifier needs labeled training data: queries with intent labels. Sources: explicit annotation (an analyst labels a representative sample of production queries); pseudo-labeling from query behavior (queries that led to specific clicks/conversions can be auto-labeled with high confidence); LLM-generated labels (prompt an LLM to label each query, with expert validation on a sample). Production deployments typically combine: small expert-labeled gold set for validation; larger pseudo- or LLM-labeled set for training.

LLM-based classification. Prompt an LLM directly: "Classify the following query as navigational, informational, transactional, or conversational. Query: [...]". The LLM produces the class label and (with appropriate prompting) a confidence score or reasoning. The pattern handles unusual queries well, adapts to taxonomy changes via prompt updates rather than retraining, and integrates context naturally (LLMs can use additional context like user history if provided). The trade-offs: latency (LLM calls in the query path add tens of milliseconds even with optimized infrastructure), cost (per-query LLM cost adds up at scale), and consistency (LLM outputs can vary unless the model is pinned and temperature is zero).

Production deployment patterns. Most mature systems combine: rules for the easiest cases (queries that obviously fit one class, handled cheaply); ML classifier for the bulk of queries (good accuracy at low latency); LLM fallback for unusual queries that the rules and ML classifier are uncertain about. The combination produces high coverage at controlled cost; the routing infrastructure (Volume 1 Section E) consumes the output for retrieval architecture selection.

Confidence calibration. Classifiers produce confidence scores; the scores need to be calibrated so that "95% confident" actually means the right answer 95% of the time. Calibration methods: Platt scaling, isotonic regression, temperature scaling for transformer outputs. Well-calibrated confidence is essential for confidence-based routing decisions; poorly calibrated confidence produces routing failures that look like classification failures.

Multi-label classification. Some queries have multiple intents. "Running shoes" is partly informational (the user wants to know about options) and partly transactional (the user is likely shopping). Multi-label classification handles this by allowing each query to have multiple class assignments with separate confidence per class. The pattern is more accurate than forcing single-class assignment but produces routing complexity — if a query is 60% informational and 40% transactional, which architecture should it route to? Production systems handle this with hybrid architectures that serve both intents.

When to Use It

Production search with heterogeneous query types (most e-commerce, most consumer search, most enterprise search). Systems where different intents would benefit from different retrieval architectures or different ranking models. Cases where query log analysis shows that uniform handling produces worse outcomes for specific intent classes.

Alternatives — single-architecture deployment for narrow workloads with uniform query types. Implicit intent (the ranking model learns intent-correlated features without explicit classification). For diverse query distributions, explicit classification typically outperforms implicit handling.

Sources

Broder, "A Taxonomy of Web Search" (SIGIR Forum, 2002)
Jurafsky and Martin, Speech and Language Processing (3rd ed., free online drafts)
Production methodology from search teams at e-commerce companies
Coveo machine learning intent documentation

Example artifacts

Code

# LLM-based intent classification (production pattern)
import anthropic
from enum import Enum

class Intent(Enum):
NAVIGATIONAL = "navigational"
INFORMATIONAL = "informational"
TRANSACTIONAL = "transactional"
CONVERSATIONAL = "conversational"

CLASSIFY_PROMPT = """Classify the following search query into one
of four intent classes:

\- navigational: User wants to find a specific item, brand, page, or
known entity.
Examples: "nike air max 270", "apple support", "SKU-12345"
\- informational: User wants to learn or browse without specific
purchase intent.
Examples: "how to clean leather shoes", "benefits of running",
"red running shoes"
\- transactional: User explicitly wants to buy, book, or complete an
action.
Examples: "buy nike shoes size 10", "book SFO to JFK flight",
"download adobe reader"
\- conversational: User asks a natural-language question expecting a
synthesized answer.
Examples: "which shoes are good for marathons?", "how does Nike
compare to Adidas?"

Query: "{query}"

Respond with JSON: {{"intent": "<class>", "confidence":
<0-1>, "reasoning": "<brief explanation>"}}"""

client = anthropic.Anthropic()

def classify_intent(query: str) -> dict:
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Haiku is fast/cheap;
sufficient for classification
max_tokens=200,
temperature=0, # zero for consistency
messages=[{
"role": "user",
"content": CLASSIFY_PROMPT.format(query=query)
}]
)
import json
text = response.content[0].text.strip()
# Strip code-fence markdown if present
text = text.replace("```json", "").replace("```",
"").strip()
try:
result = json.loads(text)
return {
"intent": Intent(result["intent"]),
"confidence": float(result["confidence"]),
"reasoning": result["reasoning"]
}
except (json.JSONDecodeError, KeyError, ValueError) as e:
# Fall back to a default class with low confidence on parse failure
return {
"intent": Intent.INFORMATIONAL,
"confidence": 0.0,
"reasoning": f"Parse failure: {e}"
}

# Hybrid pattern: rules first, LLM fallback
QUESTION_WORDS = {\'who\', \'what\', \'when\', \'where\', \'why\',
\'how\', \'which\'}
TRANSACTIONAL_TERMS = {\'buy\', \'order\', \'purchase\', \'book\',
\'download\'}

def classify_with_rules(query: str) -> dict | None:
tokens = query.lower().split()
if not tokens:
return None
if tokens[0] in QUESTION_WORDS and len(tokens) > 4:
return {"intent": Intent.CONVERSATIONAL, "confidence": 0.85,
"reasoning": "Question word + length"}
if any(t in TRANSACTIONAL_TERMS for t in tokens):
return {"intent": Intent.TRANSACTIONAL, "confidence": 0.85,
"reasoning": "Transactional term"}
return None # rules didn\'t fire; defer to LLM

def classify(query: str) -> dict:
result = classify_with_rules(query)
if result is not None:
return result
return classify_intent(query)

Intent classification across rule, ML, and LLM approaches