Also known as: joint embeddings, cross-modal embeddings, CLIP-style embeddings
TL;DR
An embedding space shared across modalities — text, image, audio, video — so a query in one modality retrieves content in another. CLIP-style contrastive training is the dominant recipe. Doing it well is far harder than doing it at all.
Multimodal embeddings map inputs from different modalities — text, images, audio, video — into a single shared vector space. A picture of a dog and the word “dog” land near each other; the same picture and the word “airplane” land far apart. Once the space exists, retrieval is just cosine similarity : embed the query in any modality, search for nearest neighbors regardless of theirs.
This unlocks cross-modal search (text → image, image → image, audio → text), zero-shot classification (compare an image to a list of class names embedded as text), and the encoder side of most vision-language models. CLIP set the recipe at scale in 2021; nearly everything since is a variation on it.
The CLIP recipe in one paragraph
Two encoders — a vision transformer for images, a text transformer for captions — produce vectors of the same dimension. For a batch of (image, caption) pairs, compute the similarity matrix, apply InfoNCE symmetrically (image-to-text and text-to-image, averaged), and train. The dataset is hundreds of millions of (image, caption) pairs scraped from the web. Batch size is in the tens of thousands — large batches matter even more here than for text-only because cross-modal alignment is harder than within-modality similarity.
That’s the whole architecture. Two encoders, one InfoNCE loss, web-scale paired data.
The modality gap
A surprising and persistent finding: even after contrastive alignment, image embeddings and text embeddings occupy separate cones of the unit sphere. They’re closer to their cross-modal positives than to random negatives, but the two modalities don’t fully interleave. Liang et al. (2022) measured this carefully — every CLIP-style model exhibits it.
Practical consequences:
Modality-gap symptoms
Image-to-image retrieval scores higher in absolute terms than image-to-text retrieval, even on a model trained for both.
Adding noise to embeddings sometimes helps cross-modal retrieval by bridging the gap.
Cosine thresholds for “match” don’t transfer between intra-modal and cross-modal queries — calibrate them separately.
Dimensionality reduction tends to surface the gap as obvious clusters in t-SNE.
The gap is partially a residual of initialization (text and image encoders start in different regions and never fully meet), partially a property of the loss (negatives within a modality are still negatives — the loss never explicitly encourages mixing).
Why doing it well is hard
Three structural problems on top of the usual contrastive challenges:
Modality imbalance. A “good” caption for an image is much shorter and lower-information than a typical text-retrieval positive. The text encoder learns a degraded signal compared to its text-only siblings — fluency, syntax, and reasoning capacity all decay. This is why CLIP’s text encoder is famously bad at sentence semantics relative to a same-size text-only embedder.
Caption quality is the ceiling. Web-scraped alt-text is noisy, generic (“image”, “photo”, “stock”), or absent. Training with bad captions teaches the model to ignore the caption-specific signal. Recent work (DALL-E 3’s recaptioning, ShareGPT4V) found that re-captioning with a strong VLM before training is one of the highest-leverage data interventions.
Hard negatives are non-trivial. Within text you can mine BM25 confusables; for image-text pairs, hard negatives need to be visually similar but textually different (or vice versa), which requires running a multimodal retriever to find them. The bootstrapping is harder.
Where multimodal embeddings show up in production
Search across product photos and SKU descriptions. Reverse image search at scale. Content moderation (compare image to policy text). Zero-shot classification when labels can be named in language. The vision side of vision-language models — Flamingo, LLaVA, GPT-4V all use a CLIP-style image encoder upstream of an LLM.
For text-document retrieval over images-of-documents (scanned PDFs, screenshots), specialized models like ColPali extend the ColBERT late-interaction pattern to image patches, treating each patch as a token. That’s the current frontier — token-level multimodal interaction rather than single-vector pooling.
The honest recommendation
For most production use cases, an off-the-shelf SigLIP-2 / EVA-CLIP gets you to acceptable quality. Fine-tune only if your domain is far from web-scale photos and you have aligned pair data — the bottleneck isn’t training, it’s clean (image, text) pairs at scale.
Go further
Why is CLIP the canonical recipe?
OpenAI's CLIP (2021) showed that contrastive training on 400M (image, caption) pairs produces zero-shot classification performance competitive with supervised ImageNet models. The recipe — separate encoders, joint embedding space, symmetric InfoNCE — generalized so well to other modality pairs that it became the default and never got dethroned.
Text and image embeddings, even when contrastively aligned, occupy different cones of the unit sphere — never fully mingled. This causes retrieval surprises (image-to-image works better than image-to-text on the same model), and it's a measurable property of every CLIP-style model. Liang et al. (2022) named and quantified it; nobody has fully fixed it.
Two main patterns. Bind everything to a hub modality (text is most common — train every other modality to align with text, get N-way alignment for free via the triangle inequality). Or train every pair simultaneously with a shared backbone, expensive but stronger. ImageBind (2023) used the hub pattern across six modalities.