Source: Joachims et al. on click-based relevance learning (KDD 2002 and successors); production practice at all major web search and e-commerce companies

Classification — Using production click data, dwell time, and conversion signals as judgment proxy.

Intent

Extract relevance signal from production user behavior at scale, accepting that the signal is biased and requires modeling to interpret correctly, in exchange for judgment volume that explicit labeling can't produce.

Motivating Problem

Production search produces enormous volumes of user behavior data: queries, clicked results, dwell times on landing pages, conversions, abandoned searches. This data could in principle replace explicit judgment lists — it's real user signal at scale. The challenge: the data is heavily biased. Users click what was shown to them in the order it was shown; click data reinforces the current system's biases rather than measuring an external truth. Section E covers click models and counterfactual evaluation — the methodology for extracting unbiased relevance signal. This entry covers the patterns for collecting and using the raw signal.

How It Works

Signals to log. Beyond clicks, production search should log: query text, time of query, user identifier (if available), what was shown (results, their positions, presentation features like rich snippets), what was clicked (which results, in what order), dwell time on each click (how long before user returned to results), subsequent actions (next query if reformulating; abandonment if leaving; conversion if purchasing or completing the task). The richer the logging, the more analyses become possible.

Click as relevance proxy. The simplest interpretation: clicked results are more relevant than non-clicked results. This works partially — users do tend to click what they find relevant — but with heavy bias: users click position 1 more than position 10 regardless of relevance; users click results with rich snippets more than plain ones; users click results from known brands more than unknown ones. Naive click-as-relevance produces self-reinforcing failure modes.

Dwell time. Users who click a result and stay on the landing page typically found something useful; users who click and immediately return (bounce) often didn't. Dwell time partially compensates for the click signal's noise: a click followed by long dwell is stronger evidence of relevance than a click followed by quick return. The pattern is sometimes called "satisfied click" vs. "unsatisfied click."

Conversion as ground truth. For e-commerce, the conversion (purchase) is the strongest relevance signal: users converted because they found what they wanted. For enterprise search, the analogous signal might be "task completed without further searches" or "result shared / forwarded." Conversion-based judgments are sparse (only a fraction of searches result in conversion) but strong; they're typically used alongside denser click-based judgments.

Query reformulation as negative signal. Users who reformulate their query after seeing results typically didn't find what they wanted. Sequential queries ("running shoes" → "men's running shoes" → "nike running shoes") signal that earlier results weren't satisfactory. Reformulation patterns are a rich source of implicit negative judgment; analyzing them surfaces queries where the system is underperforming.

Aggregation patterns. Raw click data is per-impression; useful evaluation aggregates across many impressions. Patterns: click-through rate (CTR) per query per position; mean reciprocal rank of clicks; satisfied click rate; conversion rate per query class. The aggregation level depends on the analysis: query-level for tuning specific high-volume queries; position-level for identifying ranking biases; segment-level for personalization analysis.

Privacy and ethics. User behavior data is sensitive. Logging should comply with privacy regulations (GDPR, CCPA, sector-specific rules), retention policies, and user consent frameworks. Anonymization and aggregation matter; raw per-user behavior should be handled with appropriate access controls. The discipline overlaps with the agentic AI series' compliance volume; for search specifically, the ethical handling of user behavior data is foundational.

When to Use It

Production search systems with sufficient query volume to produce meaningful aggregate signals (typically thousands of queries per day or more). Continuous monitoring and evaluation that requires more data than explicit labeling can produce. Identification of underperforming queries that explicit labeling might not cover. Personalization and segment-specific evaluation.

Alternatives — explicit labeling (prior entry) for gold sets and high-stakes evaluation. Implicit signals supplement explicit labeling rather than replacing it; the bias correction (Section E) is essential when implicit signals drive decisions.

Sources

Joachims, "Optimizing Search Engines using Clickthrough Data" (KDD 2002)
Joachims et al., "Accurately Interpreting Clickthrough Data as Implicit Feedback" (SIGIR 2005)
Production methodology writings from web search and e-commerce teams

Implicit signals and click-based judgments