RelevantSearch.AI
Pattern · Volume 09 · Section B --- LLM-augmented document processing · Updated May 2026

Semantic chunking and indexed summarization for RAG

Source: Production RAG methodology; Anthropic, OpenAI, LangChain documentation on document processing; literature 2023–2025

Classification — Pattern for breaking documents into retrieval-appropriate chunks and generating summaries that improve both retrieval recall and synthesis quality.

Intent

Prepare documents for retrieval-augmented use by chunking them into semantically coherent pieces and generating summaries that capture each chunk\'s gist, enabling better embedding-based retrieval and clearer LLM synthesis context.

Motivating Problem

Raw documents don\'t fit neatly into LLM context windows or embedding inputs. A 50-page document can\'t be embedded as one vector; the embedding would average too many concepts. But naive fixed-size chunking (every 500 tokens) breaks semantic boundaries — a paragraph might span chunks, an argument might be split. Both retrieval recall and synthesis context suffer.

LLM-augmented chunking respects semantic boundaries: paragraph breaks, section transitions, logical units. The result is chunks that each cover one coherent topic, with consistent size suitable for embedding and prompt construction.

How It Works

Step 1: structural parsing. Parse the document\'s structure (headings, paragraphs, lists) using its native format (Markdown, HTML, PDF outline). The structure provides natural chunk boundaries before LLM involvement.

Step 2: chunk sizing. Within each structural unit, apply size limits. Target chunk size depends on the downstream use: 200–500 tokens for fine-grained retrieval; 1000–2000 for synthesis context. Use overlap (50–100 tokens) between adjacent chunks to preserve cross-chunk coherence.

Step 3: semantic chunking when structure is unreliable. For unstructured text (transcripts, scanned OCR, social posts), use an LLM to identify natural break points. The LLM is shown the text and asked where to split it; the output drives chunking. Latency at index time is acceptable for this; quality is substantially better than fixed-size chunking.

Step 4: chunk summarization. For each chunk, generate a 1–2 sentence summary. The summary serves two purposes: it\'s indexed alongside the chunk text for retrieval; it\'s surfaced as part of synthesis context when the chunk is retrieved. Summaries dramatically improve recall on conceptual queries that don\'t share vocabulary with the chunk text.

Step 5: question generation. An optional extension: for each chunk, ask the LLM 'what questions does this chunk answer?' and generate 3–5 likely user questions. Index those questions alongside the chunk. Queries that match the generated questions retrieve the chunk strongly, even when the user\'s wording differs from the document\'s wording.

Step 6: metadata extraction. Extract structured metadata from each chunk — dates, entities, categories, key facts. Store as structured fields for filtering and faceting. This stage transforms unstructured text into hybrid structured-unstructured content that supports much richer queries.

Cost considerations. All of this work happens at index time. For 100,000 documents averaging 10 chunks each, that\'s 1M LLM calls for full enrichment — substantial cost. Production patterns: incremental processing (only re-process changed documents); batch processing during off-peak hours; tiered processing (cheap models for routine documents, expensive models for high-value ones); selective enrichment (summarize only chunks above a length threshold).

When to Use It

Any RAG system over substantial document collections (more than a few thousand documents). Knowledge bases, documentation search, technical reference search. Workloads where retrieval recall on conceptual queries is currently weak.

Less good fit — small document collections where the engineering investment isn\'t justified. Workloads with highly structured documents (database records) that don\'t need chunking. Cost-sensitive deployments where the index-time enrichment cost is prohibitive.

Sources
  • LangChain and LlamaIndex documentation on document chunking
  • Anthropic documentation on prompt engineering for document processing
  • RAG literature: chunk-size studies (LlamaIndex 2024 retrieval benchmarks)
Example artifacts

Code

# Semantic chunking with LLM-generated summaries (Python)

from typing import List, Dict
import anthropic

client = anthropic.Anthropic()

CHUNK_SUMMARY_PROMPT = """Summarize this passage in 1--2 sentences
capturing its key claims. Output only the summary; no preamble.

Passage:
{text}

Summary:"""

QUESTION_GEN_PROMPT = """Generate 3--5 likely questions a user
might ask that this passage would answer. Output one question per
line; no numbering.

Passage:
{text}

Questions:"""

def enrich_chunk(text: str) -> Dict[str, any]:
"""Generate summary and synthetic questions for a chunk.
Returns dict with: text, summary, questions (list).
Falls back to empty enrichments on failure.
"""
enriched = {"text": text, "summary": "", "questions": []}
# Generate summary
try:
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=150,
messages=[{"role": "user", "content":
CHUNK_SUMMARY_PROMPT.format(text=text)}]
)
enriched["summary"] = resp.content[0].text.strip()
except Exception:
pass # Empty summary on failure
# Generate synthetic questions
try:
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user", "content":
QUESTION_GEN_PROMPT.format(text=text)}]
)
questions = [
q.strip("- \u2022 \t").strip()
for q in resp.content[0].text.strip().split("\n")
if q.strip()
]
enriched["questions"] = questions[:5] # cap at 5
except Exception:
pass
return enriched

def build_index_document(chunk_text: str) -> Dict:
"""Build a complete indexable document from a chunk.
The result is indexed with multiple fields:
\- text: the chunk text itself (BM25 + embedding)
\- summary: the generated summary (BM25 + embedding boost)
\- questions: synthetic questions (BM25, treated as queries this
chunk answers)
Production retrieval queries the chunk via any of these fields,
substantially improving recall on conceptual queries.
"""
enriched = enrich_chunk(chunk_text)
return {
"text": enriched["text"],
"summary": enriched["summary"],
"questions": enriched["questions"],
# Plus: embedding(text + summary), structured metadata, etc.
}

Read in context within Volume 09 →