Source: Malkov and Yashunin, "HNSW" (2016); Jégou et al. on IVF (2011); production implementations in Pinecone, Weaviate, Qdrant, Elasticsearch / OpenSearch (k-NN plugin), Vespa, Vertex AI Vector Search

Classification — Approximate nearest neighbor retrieval over learned embeddings.

Intent

Retrieve documents based on semantic similarity by encoding queries and documents into a shared embedding space and finding nearest neighbors in that space, with approximate algorithms that scale to billions of vectors at sub-100ms latency.

Motivating Problem

Lexical retrieval misses semantic matches. A query for "pain reliever" doesn't match documents about "analgesic" without explicit synonym configuration. Paraphrases ("how do I install Python" vs "python installation guide") require manual query rewriting. Cross-lingual retrieval requires translation infrastructure. Dense vector retrieval addresses these gaps by working in a learned semantic space where conceptually-related terms produce similar vectors regardless of exact lexical overlap.

How It Works

Embedding generation. A pre-trained or fine-tuned embedding model (sentence-transformers, BGE, E5, OpenAI text-embedding-3, Voyage, Cohere embed) encodes both documents (at index time) and queries (at query time) into dense vectors, typically 384-2048 dimensions. The model is trained so that semantically-similar inputs produce vectors close to each other in the embedding space (typically by cosine similarity or inner product).

Index construction. Dense vectors are stored in a vector index designed for approximate nearest neighbor (ANN) search. HNSW (Hierarchical Navigable Small World) is the dominant algorithm in production: it constructs a layered graph where each node connects to its nearest neighbors at multiple resolution scales, enabling logarithmic-time approximate search. IVF (Inverted File Index) is the alternative: it partitions the vector space into clusters and searches only the clusters closest to the query vector. Different algorithms trade off index size, build time, query latency, and recall accuracy.

Query-time retrieval. The query is encoded into a vector using the same embedding model. The vector index returns approximate K-nearest-neighbors with their similarity scores. The candidates are typically the top 50–500 documents by vector similarity; downstream stages (ranking, reranking) refine the ordering.

Embedding model selection. The choice of embedding model substantially affects retrieval quality. General-purpose models (OpenAI text-embedding-3-large, BGE-large, Voyage 3) work for many use cases. Domain-specific models (legal, medical, code) outperform general models on their domains. Fine-tuned models on domain-specific labeled data outperform off-the-shelf models when training data is available. Model selection is itself a discipline; vendor benchmarks (MTEB, BEIR) provide comparison points but production evaluation on the actual workload is essential.

Production trade-offs. Dense retrieval has higher per-query cost than BM25 (embedding generation at query time, vector index lookup) but the cost is bounded and predictable. Index sizes are larger (1.5KB per 384-dim float32 vector vs. inverted index sizes), though quantization (binary, int8) reduces this. Updates to the corpus require re-embedding affected documents, which has cost; high-update-rate corpora may need batch re-embedding strategies. The model is a dependency that needs versioning and migration management when updated.

When to Use It

Production search where semantic matching matters beyond what synonym engineering provides. RAG pipelines for LLM-based agents (the agentic AI series' Volume 10 covers). Conversational search interfaces. Cross-lingual retrieval. Concept-level search in domains where users naturally use varied vocabulary (consumer search, customer support, knowledge management).

Alternatives — lexical retrieval (Section A) alone for narrow use cases dominated by exact and identifier matches. Hybrid retrieval (Section C) for the dominant production pattern combining both. Sparse-learned retrieval (next entry) as an alternative that preserves some lexical-style interpretability.

Sources

Malkov and Yashunin, "Efficient and robust approximate nearest neighbor search using HNSW" (2018)
Karpukhin et al., "Dense Passage Retrieval for Open-Domain QA" (2020)
BEIR benchmark suite (github.com/beir-cellar/beir)
MTEB leaderboard for embedding model comparison (huggingface.co/spaces/mteb/leaderboard)

Example artifacts

Code

// Elasticsearch / OpenSearch k-NN query (using pre-computed query
vector)
// Assumes index has a knn_vector field \'embedding\' with HNSW
configuration
GET /products/_search
{
"size": 100,
"query": {
"knn": {
"embedding": {
"vector": [0.012, -0.034, 0.156, ...], // 384-dim query
embedding
"k": 100,
"num_candidates": 500 // explore more candidates for better recall
}
}
}
}

// Pinecone equivalent
import { Pinecone } from \'@pinecone-database/pinecone\';
const pc = new Pinecone();
const index = pc.index(\'products\');
const results = await index.query({
vector: queryEmbedding,
topK: 100,
includeMetadata: true
});

// Weaviate equivalent (using nearVector)
const result = await weaviate.graphql.get()
.withClassName(\'Product\')
.withNearVector({ vector: queryEmbedding })
.withLimit(100)
.withFields(\'title description _additional { distance }\')
.do();

Dense vector retrieval (HNSW, IVF)