Source: Multiple academic, practitioner, and tool sources
Classification — Sources for staying current on indexing and document engineering practice.
Provide pointers to the active sources of indexing knowledge across IR, NLP, ML, RAG, and production practice.
Indexing intersects multiple disciplines: classical IR for index structures; NLP for analyzer chains; ML for embeddings; RAG for chunking strategies; multi-modal research for cross-modal indexing. Production teams need engagement with each.
Foundational texts. Manning, Raghavan, Schütze, Introduction to Information Retrieval (free online) — chapters 2 (text processing), 4 (index construction), 5 (index compression). Grainger, AI-Powered Search (Manning, 2024) — strong production-focused chapters on indexing for modern search including vector and hybrid patterns. Turnbull and Berryman, Relevant Search (Manning, 2016) — practical lexical-search indexing in depth.
Academic conferences. SIGIR for IR-side methods including index structures and analysis. ACL/EMNLP for NLP methods including NER and classification used in enrichment. NeurIPS/ICML/ICLR for embedding methods and multi-modal models. The CIKM, WSDM venues for IR systems work including indexing infrastructure.
Industry venues. Haystack Conference covers indexing alongside other search topics. Berlin Buzzwords for IR and data engineering. RAG-focused conferences emerging through 2024–2026 (the RAG Summit, NeurIPS RAG workshops).
Practitioner writing. Daniel Tunkelang on indexing strategy. OpenSource Connections on lexical and hybrid indexing. Search team blogs at Etsy, Wayfair, Spotify, GitHub publish substantial indexing case studies periodically. The agentic AI series' Volume 10 (RAG) covers RAG-specific indexing patterns from a different angle.
Tools and libraries. Apache Lucene (foundational; underlies Elasticsearch/OpenSearch/Solr). Vector databases: Pinecone, Weaviate, Qdrant, Chroma, Milvus. Embedding pipelines: LangChain text splitters, LlamaIndex chunking utilities, sentence-transformers for self-hosted embedding. Multi-modal: OpenCLIP, SigLIP, transformers (Hugging Face) for vision models. RAG frameworks: LangChain, LlamaIndex, Haystack (deepset).
Embedding model registries and benchmarks. Hugging Face Model Hub for pretrained models including embeddings. MTEB leaderboard for embedding model comparison. BEIR benchmark for retrieval-specific evaluation. The leaderboards inform model selection; production evaluation on the actual workload remains essential.
Communities. Hugging Face forums for embedding model discussion. LangChain and LlamaIndex Discord/Slack for RAG indexing patterns. Relevancy Engineering Slack for search-specific indexing discussion. Vector database vendor communities (Pinecone, Weaviate, Qdrant Discord servers).
Emerging areas. Long-context embedding models (handling longer documents per embedding). Multi-vector models (multiple vectors per document for finer-grained matching, ColBERT-style at the embedding side). Mult-modal embedding extending beyond image-text to audio, video, structured data. Sparse embeddings (SPLADE family) blurring the lexical/semantic boundary. The frontier is active; tracking proceedings and major model releases catches most developments.
Search engineers building or maintaining indexing pipelines. Engineers transitioning into search from data engineering or ML engineering. Continuous education as the field evolves. Reference for specific indexing decisions where current knowledge needs supplementing.
Alternatives — specialized consulting for high-stakes engagements. Internal documentation for teams with mature practice. The combination of external tracking and internal knowledge is the working pattern.
- Manning et al., Introduction to Information Retrieval (free online)
- Grainger, AI-Powered Search (2024)
- Apache Lucene documentation
- SIGIR, Haystack, Berlin Buzzwords proceedings
- LangChain, LlamaIndex, Haystack RAG framework documentation
- MTEB and BEIR benchmarks
- Relevancy Engineering Slack