Source: Multiple academic, practitioner, and vendor sources

Classification — Sources for staying current on search evaluation practice.

Intent

Provide pointers to the active sources of search evaluation knowledge: foundational texts, academic and industry conferences, practitioner blogs, tools, communities.

Motivating Problem

Search evaluation practice spans academic literature, industry methodology, vendor tooling, and emerging techniques (LLM-as-judge, advanced click models). Production teams need ongoing engagement with multiple sources to keep their evaluation discipline current.

How It Works

Foundational texts. Manning, Raghavan, Schütze, Introduction to Information Retrieval (free online at nlp.stanford.edu/IR-book) — ch. 8 on evaluation remains the canonical academic reference. Grainger, AI-Powered Search (Manning, 2024) — extensive coverage of modern evaluation methodology. Turnbull and Berryman, Relevant Search (Manning, 2016) — practical relevance engineering with evaluation throughout. Chuklin, Markov, de Rijke, Click Models for Web Search (2015) — the canonical reference on click modeling. Kohavi, Tang, Xu, Trustworthy Online Controlled Experiments (Cambridge, 2020) — A/B testing methodology applied to search.

Academic conferences. SIGIR (ACM Special Interest Group on Information Retrieval) is the premier venue for evaluation research. CIKM (Conference on Information and Knowledge Management) covers evaluation methodology adjacent to broader IR. WSDM (Web Search and Data Mining) focuses on web-scale problems including evaluation. ECIR (European Conference on Information Retrieval) is the European counterpart. TREC (Text Retrieval Conference) is both a venue and a long-running evaluation infrastructure.

Industry conferences. Haystack (haystackconf.com) is the premier practitioner conference for relevance engineering, organized by OpenSource Connections; evaluation is heavily represented. AI-Powered Search Conference (related to Grainger's book). Berlin Buzzwords covers search and data engineering. Spark + AI Summit, MLOps World, and similar venues cover evaluation infrastructure alongside broader ML evaluation.

Practitioner writing. Daniel Tunkelang (dtunkelang.medium.com) on search evaluation strategy. OpenSource Connections (opensourceconnections.com) on practical relevance engineering. Doug Turnbull's writing across multiple venues. Trey Grainger's ongoing writing on AI-powered search. Search team blogs at major companies (Etsy, Wayfair, Spotify, GitHub) periodically publish detailed evaluation methodology.

Tools. Quepid (quepid.com, open source) for judgment set management and offline evaluation. Splainer (also from OpenSource Connections) for query explanation and debugging. trec_eval (github.com/usnistgov/trec_eval) for standard metric computation. pytrec_eval (Python bindings to trec_eval). ranx (github.com/AmenRa/ranx) modern Python evaluation library. BEIR benchmark suite (github.com/beir-cellar/beir) for retrieval evaluation. MS MARCO dataset for training and evaluation.

Communities. Relevancy Engineering Slack (via OpenSource Connections invitation) is the primary practitioner community. Reddit r/searchengines, r/elasticsearch for casual discussion. LinkedIn groups around search engineering for professional networking. Conference attendance (Haystack especially) produces network effects that async channels can't replicate.

Emerging areas to watch. LLM-as-judge methodology continues to mature; Microsoft and other research groups are publishing increasingly. Counterfactual evaluation and unbiased learning-to-rank continue to advance. Evaluation for RAG and agentic systems is an emerging area where search evaluation methodology informs broader practice. Multimodal search evaluation (image, video, audio in addition to text) is a frontier with limited consolidated methodology.

When to Use It

Search engineering teams building or maintaining evaluation infrastructure. Engineers transitioning into relevance engineering from adjacent fields. Continuous learning as the discipline evolves. Reference when specific evaluation needs go beyond what existing knowledge handles.

Alternatives — specialized consulting (RelevantSearch.AI, OpenSource Connections, others) for high-stakes evaluation engagements. Internal documentation for teams with mature practice. The combination of external tracking and internal knowledge is the working pattern.

Sources

Manning et al., Introduction to Information Retrieval (free online)
Trey Grainger, AI-Powered Search (Manning, 2024)
Doug Turnbull and John Berryman, Relevant Search (Manning, 2016)
Chuklin, Markov, de Rijke, Click Models for Web Search (2015)
Kohavi, Tang, Xu, Trustworthy Online Controlled Experiments (2020)
Haystack Conference (haystackconf.com); SIGIR proceedings
Quepid (quepid.com); ranx (github.com/AmenRa/ranx); trec_eval
BEIR (github.com/beir-cellar/beir); MS MARCO; MTEB

Resources for tracking search evaluation discipline