Source: CLIP (Radford et al., 2021); production methodology for multi-modal search; modern multi-modal embedding models

Classification — Pattern for indexing documents with embeddings from multiple modalities, supporting cross-modal retrieval (text-to-image, image-to-text, audio-to-text).

Intent

Index documents that combine text, images, audio, or video by extracting embeddings from each modality and storing them as separate vector fields, supporting retrieval that matches across modalities.

Motivating Problem

E-commerce products have text descriptions and product images; users sometimes search with text ("red running shoes") and sometimes with images ("shoes that look like this"). Content discovery includes text articles, podcast episodes (audio), and video clips; users want to find content across modalities. Pure text indexing misses the visual and audio signals; pure modality-specific indexing fragments the search experience. Multi-modal indexing addresses this by extracting embeddings from each modality and combining them in retrieval.

How It Works

Cross-modal embedding models. The key technology is models that embed multiple modalities into a shared vector space. CLIP (OpenAI, 2021) was the breakthrough: text and images embedded into the same space, such that a text query and a matching image produce similar vectors. Modern alternatives: OpenCLIP (open-source CLIP variants), SigLIP (Google's improved CLIP), Cohere's multi-modal embeddings, BGE-M3 (multilingual + multi-modal). The models produce embeddings that support cross-modal queries: text against images, images against images, images against text.

Indexing pipeline. For each document with multiple modalities: extract embeddings from each modality using the appropriate model (or unified multi-modal model). For products: text embedding from title and description; image embedding from product photo. For articles with images: text embedding from body; image embeddings from inline images. Store each embedding as a separate vector field. The schema supports retrieval against any modality.

Storage and indexing. Each modality's embedding becomes a vector field in the document schema (Section A pattern). For a product: title_vec (text), description_vec (text), image_vec (visual). Each vector field is independently indexed with appropriate ANN structure. Retrieval can match against text fields with text queries, against image fields with image queries (after embedding the query image into the same space), or against both with hybrid queries.

Modality-specific extraction. Different modalities have different extraction patterns. Images: process through a vision model (CLIP or similar) to produce a single embedding per image; for documents with multiple images, decide whether to store one embedding per image (more storage, finer-grained matching) or pool images into one document-level embedding (less storage, coarser matching). Audio: process through an audio model (CLAP or similar) or first transcribe to text and embed the transcript. Video: process keyframes through a vision model and/or audio track through an audio/text pipeline.

Cross-modal retrieval. The retrieval-side pattern: a text query is embedded with the text-side of the multi-modal model; the embedding is used in ANN search against the image_vec field; results are products whose images match the text query semantically. The pattern works because the model embedded text and images into the same space — nearest neighbors in the shared space are semantically related across modalities. The query at retrieval time names which fields to search; multi-field queries (search both text and image_vec) provide additional flexibility.

Embedding model trade-offs. CLIP and successors are general-purpose; they work for many use cases but aren't optimized for specific domains. Domain-fine-tuned CLIP models (fashion-CLIP for clothing, FoodCLIP for food, BioCLIP for biology) produce substantially better results on their domains. Production deployments often fine-tune CLIP on labeled domain pairs (image, text-description) to capture domain-specific semantics.

Operational considerations. Image processing at index time has computational cost (vision model inference per image); for large catalogs the cost matters. Storage: image embeddings are typically 512–1024 dim, similar to text embeddings, with similar storage characteristics. Update frequency: image embeddings don't need to be recomputed unless the image or model changes; text embeddings should be recomputed when content changes substantively. The pipeline complexity is higher than text-only indexing but well-established patterns exist.

When to Use It

E-commerce search where product images are central to the user experience (fashion, home decor, art). Content discovery across modalities (Spotify, YouTube). Visual search applications ("find products like this"). Domain-specific multi-modal search (medical imaging plus text, satellite imagery plus text descriptions). Cross-modal search increasingly common as multi-modal LLMs and embeddings mature.

Alternatives — text-only indexing where images are tangential. Separate single-modality search engines that don't share a vector space (older pattern that doesn't support cross-modal queries). The multi-modal pattern is the modern default where multiple modalities are part of the corpus.

Sources

Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
OpenCLIP repository (github.com/mlfoundations/open_clip)
SigLIP paper and implementations
Cohere multi-modal embedding documentation
Production methodology writings on multi-modal search at major e-commerce companies

Multi-modal embedding for cross-modal search