Multimodal
When text isn't the only signal — vision, audio, and joint embedding spaces.
Multimodal models map images, audio, and video into the same representational substrate as text. The concepts below cover the architectural patterns (Vision Transformers, CLIP-style joint encoders, vision-language models that paste image tokens into an LLM context), the generation side (diffusion models, text-to-image), and the retrieval consequences (cross-modal search where a text query retrieves images, or vice versa). Multimodal is now table-stakes for any production AI product whose users hand the model a screenshot, a PDF, or a voice note.
- Audio Embeddings
Audio embeddings map a waveform or spectrogram into a fixed-size vector space where similar-sounding clips land near each other. Wav2Vec 2.0, HuBERT, and BGE-Audio set the modern recipe.
- CLIP
CLIP (Contrastive Language-Image Pretraining, Radford et al. 2021) is a dual-encoder model that embeds images and text into a shared vector space. It is trained contrastively on 400M (image, caption) pairs scraped from the web.
- Diffusion Model
A diffusion model generates images by iteratively denoising pure Gaussian noise. The forward process gradually adds noise to a real image; the reverse process is a learned neural network that removes it step by step.
- Flow Matching
A generative-modeling objective that learns a continuous vector field transporting noise to data along straight or curved probability paths. Generalizes and often replaces diffusion: simpler training, faster sampling, and the substrate behind SD3, Flux, and Veo.
- Image Encoder
An image encoder maps a raw image into a sequence of patch embeddings or a pooled vector. Modern multimodal stacks use Vision Transformer (ViT) encoders that tokenize the image into 16x16 or 14x14 patches.
- Multimodal RAG
Multimodal RAG is retrieval-augmented generation where the query, the documents, or both span multiple modalities — PDFs with figures, screenshots, voice queries, or image-grounded answers.
- OCR (Optical Character Recognition)
OCR converts image regions containing text into machine-readable strings. Classical pipelines (Tesseract, Google Cloud Vision, AWS Textract) detect text regions then recognize them via CNN+LSTM.
- SigLIP
SigLIP (Zhai et al. 2023) replaces CLIP's softmax contrastive loss with a per-pair sigmoid loss, decoupling each (image, text) pair from the rest of the batch.
- Text-to-Image
Text-to-image is the generation capability where a natural-language prompt produces an image. The dominant architecture is a CLIP-conditioned latent diffusion model.
- Vision Transformer (ViT)
The Vision Transformer applies a standard transformer to image patches instead of words. An image is cut into a grid of 16×16 patches, each linearly embedded into a token, fed to a transformer encoder with positional encodings.
- Vision-Language Model (VLM)
A vision-language model is an LLM that can see. Image patch embeddings are projected into the LLM's token space and concatenated with text tokens; the model treats them as a uniform sequence and generates text autoregressively.
- Visual Question Answering
Visual question answering (VQA) is the task of producing a natural-language answer to a question about an image. It is the canonical benchmark for vision-language models because it forces grounding.
- Whisper ASR
Whisper (OpenAI, 2022) is an encoder-decoder transformer for automatic speech recognition trained on 680K hours of weakly-supervised multilingual audio.
- Foundations 48
The bedrock primitives every other topic builds on.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Language Models 32
The foundational substrate of modern AI.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
- Production 16
From notebook to live traffic.
