Also known as: DDPM, denoising diffusion probabilistic model, score-based model
TL;DR
A diffusion model generates images by iteratively denoising pure Gaussian noise. The forward process gradually adds noise to a real image; the reverse process is a learned neural network that removes it step by step.
A diffusion model generates images by running noise through a learned cleanup function many times. Start with pure Gaussian noise, predict the noise contained in it, subtract a fraction of that prediction, repeat 20-1000 times. What’s left is a coherent image. The model is trained on the inverse problem: given a clean image with noise added at a known timestep, predict the noise.
This is the dominant approach to image generation, displacing GANs decisively around 2022. Stable Diffusion, DALL-E 2/3, Imagen, Flux, Midjourney are all diffusion models with different conditioning, data, and architectural choices. The core algorithm is the same.
Forward and reverse processes
The forward process is fixed and parameter-free: at each timestep , add Gaussian noise on a known schedule. After steps the image is indistinguishable from random noise. The schedule (linear, cosine, sigmoid) controls how fast noise accumulates — cosine schedules outperform linear because they preserve signal longer in early steps where structure is established.
The reverse process is the learned part. A neural network takes the noisy image at timestep and the timestep itself as input, and predicts the noise that was added. Subtract that prediction, advance to timestep , repeat. The training objective is mean-squared error between predicted and actual noise — embarrassingly simple compared to GAN adversarial training.
Latent diffusion: the practical trick
DDPM (Ho et al. 2020) ran in pixel space and was painfully slow. Stable Diffusion’s contribution — latent diffusion — encodes the image with a VAE first, runs diffusion in the compressed latent space (64×64×4 instead of 512×512×3), then decodes back to pixels at the end. ~48× fewer values to denoise; diffusion in latent space is correspondingly cheaper. Every modern open-weight image model — SDXL, SD3, Flux — uses this.
Architectures: U-Net to DiT
DDPM used a U-Net — a CNN with skip connections at multiple resolutions, time-step injected via embedding, text injected via cross-attention. Stable Diffusion 1/2/XL kept this shape.
The frontier is DiT (Diffusion Transformer) — replace the U-Net with a vision transformer that processes the latent as a sequence of patch tokens. SD3 and Flux both use DiT. The transformer scales better, conditions on text more cleanly (text and image tokens share the same attention machinery), and benefits from the same architectural improvements as LLMs.
Diffusion models in production
Flux.1 — Black Forest Labs, DiT-based, currently the top open-weight text-to-image model.
Sora / Veo 3 — extend the same diffusion mechanics to video by adding a temporal dimension.
A GAN samples in one forward pass — the generator maps noise to image directly. A diffusion model needs 20-1000 sequential forward passes; each pass is a denoising step and you can’t parallelize them because step depends on step . Distillation techniques (consistency models, LCM, SDXL Turbo) train a student to mimic the teacher’s trajectory in 1-4 steps. Quality drops slightly but inference becomes near-instant — the path to real-time image generation.
Classifier-free guidance (CFG) is the trick that makes text-conditioned diffusion actually follow the prompt. At inference, the model is run twice per step — once with the text condition, once with empty conditioning — and the predictions linearly extrapolated: . Scales of 5-10 are typical.
This amplifies the conditioning signal far beyond what training alone produces. Without CFG, diffusion models follow prompts loosely. Too high (above 15), images oversaturate and lose realism. Tuning CFG is one of the few inference-time knobs that visibly changes outputs.
Conditioning: how the prompt gets in
Text conditioning is injected via cross-attention from image tokens to a text embedding, typically CLIP ’s text encoder, often combined with T5-XXL for richer text understanding (Imagen, SD3). The timestep is injected via sinusoidal embedding added to feature maps. Conditioning is what makes text-to-image work; everything else is the engine that consumes it.
Go further
Why iterative denoising instead of generating in one shot?
Generating a high-resolution image directly from noise is too hard a learning problem — the data distribution is too complex relative to the prior. Splitting it into 50-1000 small denoising steps means each step solves an easier conditional problem. The full distribution emerges from composing the small steps. This is the same trick as autoregressive generation factoring text into next-token predictions, but in a continuous space.
Latent diffusion (Stable Diffusion's contribution) runs the diffusion process in the compressed latent space of a VAE rather than at pixel resolution. A 512×512 image becomes a 64×64×4 latent — 64× fewer values to denoise. This makes the whole pipeline fast enough to run on consumer GPUs. Flux, SDXL, and Stable Diffusion 3 all use latent diffusion.
The denoising network takes the noisy image and a conditioning signal (CLIP text embedding, T5 embedding, or both) as input. The text is injected via cross-attention layers — every block attends from image tokens to text tokens. Classifier-free guidance amplifies the conditioning at inference by extrapolating between conditional and unconditional predictions.