Source: Production operational methodology at LLM-augmented search products; vendor operational documentation (Anthropic, OpenAI, Cohere); literature 2024–2026

Classification — Patterns for running LLM-augmented search reliably at scale, including cost control, latency management, monitoring, and graceful failure.

Intent

Extend traditional search operational practice (Vol 6) to handle the new operational concerns LLM augmentation introduces: variable per-query cost, latency tails, drift, vendor dependencies.

Motivating Problem

LLM-augmented search has operational characteristics traditional search doesn't. Per-query cost varies with prompt length and output length. Latency p99 can be 5–10x p50. The system depends on a vendor that may have outages, rate limits, or unexpected behavior changes. Cost can spike unexpectedly under load. The operational discipline must handle all of this.

How It Works

Cost monitoring and budgets. Track LLM cost per service, per user, per session, per query class. Set explicit budgets at multiple levels (per-day, per-user, per-feature) with alerting when approaching limits and hard cutoffs when exceeded. Production patterns: tag every LLM call with attribution metadata (service, user, query type); aggregate in real-time; expose dashboards by attribution dimension; alert on anomalies (sudden spend increase).

Tiered model selection. Match model class to query difficulty. Production patterns: classify queries by difficulty (cheap LLM call, heuristic, or based on user/context); route easy queries to cheap models (Haiku-class) and hard queries to expensive models (Sonnet, Opus). Even simple tiering can reduce cost 5–10x with modest quality impact.

Caching at multiple levels. LLM input caching (for prompts that share long prefixes); LLM output caching (for repeated (query, context) inputs); embedding caching (for repeated embedding generation calls). Anthropic and OpenAI both offer prompt caching with substantial cost savings for prefix reuse. Production patterns: design prompts with stable prefixes (system prompt + few-shot examples) followed by variable user input; cache the prefix; cost savings often 50–80% on cached calls.

Latency tail management. p99 LLM latency can be much worse than p50. Production patterns: streaming responses (user sees first tokens quickly even if full generation is slow); per-call timeouts with fallback paths; circuit breakers that disable LLM stages temporarily under sustained latency pressure; capacity planning around p99, not p50.

Rate limit handling. Vendor APIs have rate limits (RPM, TPM). Production patterns: track current rate against limits; queue requests when approaching limit; implement backoff and retry with jitter; have alternative providers configured for failover. Production teams routinely discover their primary vendor at full capacity during traffic spikes; multi-vendor configuration provides resilience.

Drift detection. Vendor model updates can change behavior. Production patterns: pin to specific model versions (not 'latest'); run regression suite on every deployment; monitor key quality metrics over time (sudden changes indicate drift); maintain canary deployments that test new model versions on small fractions of traffic before full migration.

Fallback patterns. When LLM stages fail, gracefully degrade. Production patterns: synthesis fallback (show retrieval results as list); reranker fallback (use post-fusion ranking unchanged); query rewriting fallback (use raw query); have these fallback paths tested regularly to ensure they still work. Production teams routinely discover decayed fallback paths only when they're needed.

Vendor risk management. LLM-augmented systems depend on a small number of vendor APIs. Production patterns: multi-vendor configuration (failover to alternative provider); contractual SLAs for high-availability deployments; cost forecasting that accounts for vendor price changes; periodic re-evaluation of vendor choices as the market evolves.

Incident response for LLM-specific failures. Different from traditional incidents. Patterns: runbook for hallucination spikes (often indicates prompt or context degradation); runbook for cost anomalies (usually a prompt change or unexpected user behavior); runbook for vendor outages (failover steps, user-facing messaging); post-incident reviews that examine LLM-specific contributing factors (prompt changes, model version changes, context changes).

When to Use It

Any production LLM-augmented search at non-trivial scale. The operational discipline is roughly proportional to scale; small experiments need light operations, production systems serving millions of queries need full operational practice.

Less good fit — internal-only systems where reliability requirements are loose. Prototype systems where the operational investment isn't justified yet. The discipline scales with the system's production importance.

Sources

Anthropic documentation on prompt caching and operational patterns
OpenAI documentation on rate limits and operational patterns
LangSmith and LangFuse documentation on LLM observability
Volume 6 of this series for the foundation operational discipline

Operational patterns for production LLM-augmented search