Source: Production methodology for indexing pipeline operations; SRE observability patterns; Volume 3 indexing patterns
Classification — Operational pattern for monitoring the indexing pipeline's health and catching issues before they affect search quality.
Maintain operational visibility into the indexing pipeline — throughput, latency, freshness, completeness, error rates — so that indexing issues are caught and fixed before they degrade search quality for users.
The indexing pipeline runs continuously, often in the background, often with limited direct visibility to the search team. When it fails or degrades, the search system's quality drops without obvious cause: new content doesn't appear; updated content shows stale fields; specific document types fail to index; the index falls behind real-time content updates. Without monitoring, these issues are caught only when users notice and complain. With monitoring, the team catches them proactively.
Throughput monitoring. Track documents indexed per minute (or per hour, for slower pipelines). The metric should be steady; sudden drops suggest pipeline issues; sudden spikes suggest backlog catch-up (which may indicate prior failure). Alert when throughput drops below baseline or when the pipeline is processing zero documents for an extended period. The baseline depends on the workload — e-commerce systems may index millions of documents per day; enterprise systems may index thousands.
Freshness monitoring. Measure the lag between content changes and their appearance in the index. For real-time indexing pipelines, the lag should be seconds; for batch pipelines, the lag should be within the batch window. Alert when freshness exceeds threshold. The metric is sensitive: a document that was updated 30 minutes ago but doesn't appear with the update is a problem. Production patterns: emit a freshness signal per document (timestamp when indexed minus timestamp of source change); aggregate to p50, p95, p99 freshness; alert on p95 exceeding threshold.
Completeness monitoring. The index should contain the expected documents. Periodic checks: compare index document count to source-of-truth document count (catalog database, content management system, etc.); flag substantial discrepancies; investigate root causes. Patterns: sample documents from source and verify each is in the index with current content; spot-check critical documents (high-traffic products, important pages) for presence and currency. Without completeness monitoring, the team finds out about missing documents from user complaints.
Error rate monitoring. Indexing pipelines have failure modes: malformed source data; missing required fields; LLM API errors during enrichment; embedding API rate limits; storage failures. Track error rates per stage; alert when rates exceed baselines. Investigate spikes promptly — a steady stream of errors often indicates a systematic problem that's affecting many documents silently.
Per-field monitoring. Beyond document-level metrics, monitor field-level health. Field completion rates: what fraction of documents have non-null values for each important field? A drop in completion rate for an enriched field signals upstream issues with the enrichment process. Distribution checks: do field values follow expected distributions? Sudden changes in category distributions or price distributions signal data quality issues.
Vector index health. Vector fields have their own monitoring concerns. Vector embedding rate (documents getting embedded per unit time); embedding failure rate; vector index size; vector index recall (sample-based check that nearest-neighbor queries return expected results). Production patterns: maintain a small test set of queries with known nearest neighbors; periodically run them and verify the expected documents are returned; alert if recall drops.
Reindex monitoring. When blue/green reindexing (Volume 3 Section F) is in progress, monitor it specifically. Progress (what percentage of documents have been reindexed); ETA (when will reindex complete); errors during reindex (failures that need investigation before alias swap); validation checks (does the new index pass spot-check tests). Production patterns: dashboard the reindex progress for visibility; alert on stalls or excessive error rates; gate the alias swap on validation passing.
Integration with search-side monitoring. Index health alerts are operationally distinct from search-quality alerts but related. A regression in search quality may root-cause to an index issue (Section E covers the diagnostic methodology). Production patterns: cross-reference index health metrics with search quality metrics in the same dashboards; on-call rotations cover both areas; postmortems trace search quality issues back through to indexing causes where appropriate.
Every production search system with active indexing. The investment is modest (the monitoring extends the search engine's built-in observability); the returns prevent silent search-quality degradation.
Alternatives — manual periodic checks for small, slow-changing indices. Some systems can operate without dedicated indexing monitoring at very small scale; most cannot.
- Elasticsearch / OpenSearch / Solr operational documentation
- Production SRE observability methodology
- Volume 3 of this series for the indexing patterns being monitored