Source: Production methodology at major RAG products (Perplexity, You.com, ChatGPT browse/search); RAG literature 2023–2025
Classification — Pattern for rewriting user queries to be retrieval-ready, with conversation context preserved for multi-turn search.
Transform raw user queries into queries that produce better retrieval, particularly handling pronoun resolution, context dependencies, and the gap between conversational language and indexable terms.
Raw user queries often retrieve poorly. 'How much does it cost' alone has no useful retrieval target; the context of the prior conversation is needed. 'Best for kids' needs the category context (best running shoes? best laptops?) to retrieve usefully. Traditional query analyzers can't fill these gaps because they don't reason about context.
LLMs are well-suited to query rewriting because the task is exactly what they're trained for: take linguistic input, produce linguistic output, using context. The latency cost is acceptable (single LLM call, typically 100–300ms with a small fast model); the quality lift is substantial; the implementation is straightforward.
Input. The user's current message plus the conversation history (or a summary of it). Production patterns: include last 3–5 turns verbatim; longer history summarized; cap total context at a few thousand tokens to control LLM cost.
Prompt. A clear instruction to produce a self-contained query suitable for retrieval. The prompt should specify: produce a single retrieval query, not multiple; preserve user intent precisely; resolve pronouns and references from context; output the query directly without explanation.
Output handling. The LLM returns the rewritten query. Production patterns: strip any wrapping (explanations, quotes) the LLM may add; validate the output isn't empty; fall back to the original query if rewriting fails or produces obviously wrong output.
Multi-query expansion. A variant where the LLM produces multiple queries from a complex input. Useful when the user's question spans multiple sub-topics; each sub-query goes through retrieval independently and results are merged. Production patterns: cap the number of sub-queries (typically 3–5) to control cost; assign each sub-query equal weight unless intent suggests otherwise.
Caching. The same (query, context) pair appearing again should hit cache. Production patterns: cache the rewritten query keyed on a hash of the user query and a few preceding turns; cache lifetime modest (hours) since users' conversational patterns evolve. Even modest cache hit rates substantially reduce cost.
Failure handling. The rewriting LLM may fail (timeout, rate limit, vendor outage). Fallback: use the original query unchanged. The fallback path must be tested regularly; production teams routinely discover their fallback paths have decayed.
Conversational search products where multi-turn queries are common. Workloads where user queries are short or context-dependent ('what about for kids?', 'how much?'). RAG systems where the retrieval quality directly determines the generated answer quality.
Less good fit — single-shot queries where there\'s no conversation context to bring in. E-commerce search where queries are explicit and retrieving the exact terms matters. High query volume systems where the per-query LLM cost is prohibitive (consider caching aggressively or limiting to harder queries).
- Perplexity AI engineering blog posts on conversational query handling
- OpenAI blog posts on ChatGPT search retrieval architecture
- RAG literature: Lewis et al. (2020) original RAG paper; Gao et al. (2024) RAG survey
Code
# Query rewriting with conversation context (Python + Anthropic SDK)
import anthropic
from typing import List, Dict
client = anthropic.Anthropic()
QUERY_REWRITE_PROMPT = """You are a search query rewriter. Given a
conversation history and a new user message, produce a single
self-contained search query suitable for retrieval.
Rules:
\- Produce ONE query, not multiple
\- Resolve any pronouns or references from the conversation context
\- Preserve the user\'s intent precisely
\- Do not add interpretation beyond what\'s in the conversation
\- Output ONLY the query, no explanation, no quotes
Conversation history (most recent last):
{history}
User\'s new message: {query}
Rewritten query:"""
def rewrite_query(query: str, history: List[Dict[str, str]]) ->
str:
"""
Rewrite a user query for retrieval, using conversation context.
Args:
query: The user\'s current query
history: List of {role: \'user\'|\'assistant\', content: str} dicts
Returns:
A retrieval-ready query string. Falls back to the original on
failure.
"""
if not history:
return query # No context, nothing to rewrite
# Format recent history (last 5 turns)
recent = history[-5:]
formatted = "\n".join(
f"{turn[\'role\']}: {turn[\'content\']}" for turn in recent
)
try:
response = client.messages.create(
model="claude-haiku-4-5-20251001", # cheap, fast
max_tokens=200,
messages=[{
"role": "user",
"content": QUERY_REWRITE_PROMPT.format(
history=formatted, query=query
)
}]
)
rewritten = response.content[0].text.strip()
# Validate: not empty, not too long
if not rewritten or len(rewritten) > 500:
return query
# Strip quotes if LLM added them
rewritten = rewritten.strip(\'"\').strip("\'").strip()
return rewritten
except Exception:
# Any failure → fall back to original query
return query
# Example usage:
history = [
{"role": "user", "content": "What are the best running shoes
for trail?"},
{"role": "assistant", "content": "Top trail running shoes
include the Salomon Sense Ride..."},
{"role": "user", "content": "What about for kids?"},
]
# Without rewriting: "What about for kids?" → useless retrieval
# With rewriting: "best trail running shoes for kids" → strong
retrieval