Source: Production methodology at e-commerce and content platforms; LLM provider documentation; Pradeep et al. on LLM-based document processing

Classification — Index-time enrichment using LLMs to extract structured attributes, summaries, and classifications from raw documents.

Intent

Extract structured signals from raw document content using LLM-based processing at index time, producing fields that retrieval can filter on and ranking can use as features.

Motivating Problem

Raw documents contain structured information implicitly that classical NLP methods can't reliably extract. A product description in natural language may mention the material, target use case, age recommendation, care instructions, and many other attributes — each as part of the prose rather than as labeled fields. Classical NER catches named entities (brands, locations) but misses domain-specific attributes that require contextual interpretation. LLM-based extraction handles these cases: the model reads the description, understands what's being communicated, and outputs structured fields.

How It Works

The pattern. For each document at index time: send the document content to an LLM with a structured extraction prompt; parse the LLM's JSON response; store the extracted fields as part of the indexed document. The fields become available for filtering, faceting, and ranking just like manually-curated structured fields would be.

Prompt design. The extraction prompt should specify exactly what fields to extract, with type guidance ("return null if not specified", "extract as ISO date if possible", "one of [casual, formal, athletic]"). Few-shot examples in the prompt substantially improve extraction quality; production prompts typically include 2–5 worked examples. The prompt should produce well-defined JSON output; tool/function-calling APIs (Anthropic's tool use, OpenAI's function calling) provide stronger structure guarantees than freeform JSON generation.

Model selection. Smaller, cheaper models (Claude Haiku, GPT-4o-mini, Gemini Flash) handle straightforward extraction with low cost. Larger models (Claude Opus, GPT-4) are warranted for complex extraction or when accuracy is critical. The trade-off depends on document complexity and cost sensitivity. Production deployments often use smaller models by default and route specific document types to larger models when justified.

Batch processing. Extraction at index time can be batched aggressively. For large initial indexing runs, send batches of documents through the LLM API in parallel; production deployments routinely process thousands of documents per minute through batched extraction. The Anthropic batch API and similar batch endpoints offer cost discounts (50% off) for non-real-time processing, making large extraction runs more economical.

Validation. LLM extraction quality varies; production deployments need validation. Hold out a labeled sample (manually verified) and measure extraction accuracy per field. Track extraction rates over time (fraction of documents where each field was successfully extracted); a sudden drop signals upstream issues. Spot-check extracted values against source documents for high-value fields.

Failure handling. LLM calls can fail or produce malformed output. Production extraction pipelines handle failures: retry with backoff for transient API errors; fall back to null/missing values for repeated failures; alert on failure rates above thresholds; maintain a queue of failed documents for re-processing. The pipeline should not let LLM failures block the entire indexing process.

Cost management. LLM extraction at index time has per-document cost; for large corpora the total cost matters. Strategies: process only high-value content with LLM extraction (use cheaper methods for the long tail); extract only the high-value fields with LLM (use rules or classical NLP for simple fields); cache extraction results aggressively so re-indexing doesn't pay the cost again; use batch APIs when latency permits.

Incremental enrichment. New documents arrive continuously; the pipeline must process them. Production patterns: process new documents through the LLM extraction pipeline as they arrive, with appropriate parallelism; for updated documents, only re-extract if the changed fields affect extracted output (avoid unnecessary re-extraction of unchanged content). The discipline keeps the index current without over-processing.

When to Use It

E-commerce search where products have unstructured descriptions that contain extractable structured information (material, occasion, age, fit, style). Content platforms where articles need topic extraction, sentiment analysis, or summarization. Enterprise search where documents have implicit structure (jurisdiction, date, case type, document type) that explicit fields would benefit from. Use cases where classical NLP methods can't reach the extraction quality the downstream stages need.

Alternatives — classical NLP for entities and standard classification (cheaper, faster, predictable). Rule-based extraction for simple cases. Manual annotation for the highest-value content where neither classical nor LLM methods are sufficient. Production deployments typically combine all three: rules for the simplest cases, classical NLP for standard cases, LLM for cases needing contextual understanding.

Sources

Anthropic Claude documentation on tool use and batch API
OpenAI function calling documentation
Production methodology writings on LLM-based document processing
Pradeep, Nogueira, Lin on LLM-based document processing

Example artifacts

Code

# LLM-based attribute extraction at index time with Claude
import anthropic
import json
from typing import Optional

client = anthropic.Anthropic()

# Define the extraction tool/schema - tool use gives structural
guarantees
EXTRACTION_TOOL = {
"name": "extract_product_attributes",
"description": "Extract structured product attributes from the
description",
"input_schema": {
"type": "object",
"properties": {
"material": {
"type": ["string", "null"],
"description": "Primary material (e.g., \'leather\', \'mesh\',
\'rubber\'). Null if not specified."
},
"occasion": {
"type": ["string", "null"],
"enum": ["casual", "formal", "athletic", "work",
"outdoor", None],
"description": "Primary use occasion. Null if unclear."
},
"target_gender": {
"type": ["string", "null"],
"enum": ["mens", "womens", "unisex", "kids", None]
},
"features": {
"type": "array",
"items": {"type": "string"},
"description": "Notable product features mentioned (e.g.,
\'waterproof\', \'memory foam insole\')"
},
"summary": {
"type": "string",
"description": "One-sentence product summary suitable for semantic
embedding"
}
},
"required": ["summary"]
}
}

def extract_attributes(title: str, description: str) -> dict:
"""Extract structured attributes from a product description."""
prompt = f"""Extract structured attributes from this product:

Title: {title}
Description: {description}

Use the extract_product_attributes tool. Be conservative --- only
fill fields where the description provides clear evidence; use null
otherwise."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Haiku for speed and cost
max_tokens=1024,
tools=[EXTRACTION_TOOL],
tool_choice={"type": "tool", "name":
"extract_product_attributes"},
messages=[{"role": "user", "content": prompt}]
)
# Tool use guarantees structured output
for block in response.content:
if block.type == "tool_use":
return block.input
return {}

# Batch processing pattern for initial indexing
def enrich_documents_batch(docs: list[dict]) -> list[dict]:
"""Enrich a batch of documents. In production, use the Anthropic
batch API for cost savings."""
enriched = []
for doc in docs:
try:
attrs = extract_attributes(doc["title"], doc["description"])
enriched.append({
**doc,
"material": attrs.get("material"),
"occasion": attrs.get("occasion"),
"target_gender": attrs.get("target_gender"),
"features": attrs.get("features", []),
"llm_summary": attrs.get("summary"), # source for body_vec
embedding
})
except Exception as e:
# Log and continue; don\'t block indexing on enrichment failures
print(f"Extraction failed for {doc.get(\'id\')}: {e}")
enriched.append(doc)
return enriched

# Validation pattern: check extraction quality against held-out
labeled sample
def validate_extraction(labeled_sample: list[dict]) -> dict:
"""Compare LLM extraction against manually labeled gold
sample."""
results = {"material": {"correct": 0, "total": 0},
"occasion": {"correct": 0, "total": 0}}
for doc in labeled_sample:
extracted = extract_attributes(doc["title"],
doc["description"])
for field in results:
if doc.get(f"gold_{field}") is not None:
results[field]["total"] += 1
if extracted.get(field) == doc[f"gold_{field}"]:
results[field]["correct"] += 1
return {
field: r["correct"] / r["total"] if r["total"] > 0 else
None
for field, r in results.items()
}