Source: RAG framework documentation (LangChain, LlamaIndex); production methodology in 2023–2026 RAG deployments; semantic chunking literature
Classification — Methods for breaking long documents into retrieval-sized chunks for embedding and vector retrieval.
Choose a chunking strategy appropriate to the document type and retrieval needs, producing chunks that maximize retrieval quality at acceptable index size and indexing cost.
A long document embedded as a single vector dilutes meaning. A query about a specific topic on page 30 of a 50-page document produces a poor match against the document's pooled embedding. Chunking addresses this by producing focused embeddings: each chunk represents a coherent piece of content, embedded separately, indexed as a separate retrievable unit. The strategy is how the chunking is done; the choice affects retrieval quality substantially.
Fixed-size chunking. Split the document into chunks of N tokens with overlap. Typical configuration: 512 tokens per chunk with 50-token overlap (overlap prevents matches at chunk boundaries from being lost). The strategy is simple, predictable, and fits any embedding model's input limits. The downside: splits ignore natural boundaries, breaking sentences and paragraphs mid-stream. Despite this limitation, fixed-size chunking remains popular for its simplicity and is the default in many RAG frameworks.
Sentence and paragraph chunking. Split on natural language boundaries. Process: identify sentence/paragraph boundaries; group sentences or paragraphs into chunks up to a size limit (e.g., 512 tokens); maintain overlap between chunks. The result is chunks that respect natural units — better embedding quality than fixed-size, but with variable sizes. For documents where paragraphs are well-formed (articles, manuals, books), this strategy often produces better retrieval than fixed-size.
Semantic chunking. Identify topic boundaries within the document and split there. The standard algorithm: embed each sentence individually; compute pairwise cosine similarity between adjacent sentences; identify sharp similarity drops as topic boundaries; group sentences between boundaries into chunks. The result is chunks that align with semantic topics; embedding quality is high. The cost is additional compute at index time (embedding every sentence to determine boundaries). Semantic chunking is the modern production default for complex documents where the additional compute is justified.
Hierarchical chunking. Index multiple chunk levels: small chunks (e.g., individual sentences or short paragraphs) for precise retrieval; larger chunks (e.g., sections) or full documents accessible at retrieval time for context. At retrieval time: search the small chunks; for each match, fetch the parent section or full document as context. The pattern provides precision (small chunks match well) and context (the parent provides surrounding information that helps ranking or downstream LLM consumption). Storage overhead is higher — chunks at multiple levels — but retrieval quality benefits.
Chunk overlap. All chunking strategies typically include overlap between adjacent chunks (10–20% of chunk size is typical). The overlap ensures that content near chunk boundaries isn't lost: a query that matches content split across two chunks can match either chunk because the boundary content appears in both. Without overlap, boundary content can be effectively invisible to retrieval. The overlap costs index size proportionally; the quality benefit usually justifies it.
Chunk metadata. Each chunk should carry metadata that retrieval can use: the source document ID (so retrieval can group chunks by document or fetch the parent document); the chunk position within the document (for ordering and context); the source section/header (for displaying "from section X" in results); the chunk size (for adapting downstream processing). The metadata is part of the indexed representation and consumed at retrieval time.
Document type considerations. Different document types benefit from different strategies. Code documentation: chunk by function or class. API documentation: chunk by endpoint. Books: chunk by chapter or section. Technical articles: chunk by section, with semantic chunking within sections. Product descriptions: typically short enough not to need chunking. The strategy should fit the structure of the documents being indexed.
Chunk size tuning. The optimal chunk size depends on the workload. Small chunks (256 tokens): precise retrieval but may lack context. Large chunks (1024 tokens): more context but may dilute meaning. Most production deployments find optimal sizes between 384 and 768 tokens. The tuning is empirical: test multiple sizes against the workload's queries and pick the size with the best retrieval metrics (NDCG@K, MRR; Volume 5 covers).
Any production system doing vector retrieval on long documents. RAG pipelines (agentic AI Volume 10). Enterprise search over documentation. Content discovery over articles or research papers. The chunking strategy is foundational to retrieval quality; the choice matters.
Alternatives — no chunking for short documents (entire document fits in one embedding). Document-level retrieval with separate ranking-time context extraction for cases where retrieval can be at document level but generation needs specific spans. The chunking decision depends on document type and retrieval needs.
- LangChain documentation on text splitters
- LlamaIndex documentation on chunking
- Production methodology writings on RAG chunking (2023–2026)
Code
# Semantic chunking implementation - the modern production default
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Dict
model = SentenceTransformer(\'all-mpnet-base-v2\')
def semantic_chunk(
text: str,
similarity_threshold: float = 0.5,
min_chunk_tokens: int = 128,
max_chunk_tokens: int = 768,
) -> List[Dict]:
"""Split text into chunks at semantic boundaries.
Algorithm:
1. Split into sentences
2. Embed each sentence
3. Compute similarity between adjacent sentences
4. Split where similarity drops below threshold (topic boundary)
5. Respect min/max chunk size constraints
"""
# Split into sentences (production: use spaCy or nltk for robust
splitting)
import re
sentences = re.split(r\'(?<=[.!?])\s+\', text.strip())
if len(sentences) < 2:
return [{"text": text, "start_sent": 0, "end_sent": 0}]
# Embed all sentences
embeddings = model.encode(sentences, normalize_embeddings=True)
# Compute pairwise similarity between adjacent sentences
similarities = [
float(np.dot(embeddings[i], embeddings[i+1]))
for i in range(len(sentences) - 1)
]
# Identify boundary candidates: similarity dips
chunks = []
current_chunk_start = 0
current_token_count = len(sentences[0].split())
for i, sim in enumerate(similarities):
next_sent = sentences[i + 1]
next_tokens = len(next_sent.split())
# Force-split if max size would be exceeded
if current_token_count + next_tokens > max_chunk_tokens:
chunks.append({
"text": " ".join(sentences[current_chunk_start:i + 1]),
"start_sent": current_chunk_start,
"end_sent": i,
})
current_chunk_start = i + 1
current_token_count = next_tokens
continue
# Topic-boundary split if similarity dips and chunk is big enough
if sim < similarity_threshold and current_token_count >=
min_chunk_tokens:
chunks.append({
"text": " ".join(sentences[current_chunk_start:i + 1]),
"start_sent": current_chunk_start,
"end_sent": i,
})
current_chunk_start = i + 1
current_token_count = next_tokens
else:
current_token_count += next_tokens
# Final chunk
if current_chunk_start < len(sentences):
chunks.append({
"text": " ".join(sentences[current_chunk_start:]),
"start_sent": current_chunk_start,
"end_sent": len(sentences) - 1,
})
return chunks
# Hierarchical chunking: small chunks for retrieval, larger context
available
def hierarchical_chunk(doc: dict) -> dict:
"""Produce small chunks linked to their parent section."""
sections = parse_into_sections(doc["text"]) #
implementation-specific
all_small_chunks = []
for section_idx, section in enumerate(sections):
small_chunks = semantic_chunk(section["text"],
max_chunk_tokens=256)
for chunk in small_chunks:
all_small_chunks.append({
**chunk,
"doc_id": doc["id"],
"section_idx": section_idx,
"section_text": section["text"], # available for context at
retrieval time
"section_title": section["title"],
})
return {
"doc_id": doc["id"],
"chunks": all_small_chunks,
"sections": sections,
}