Text-to-Image

Also known as: T2I, text2image, image generation

TL;DR

Text-to-image is the generation capability where a natural-language prompt produces an image. The dominant architecture is a CLIP-conditioned latent diffusion model.

Text-to-image is the user-facing capability that took the AI field from a research curiosity to a mainstream consumer product in 2022. Type a sentence, get an image. The mechanics underneath are straightforward: a CLIP -conditioned latent diffusion model . The prompt is encoded into a vector; the diffusion model generates an image conditioned on that vector. Stable Diffusion, DALL-E 3, Midjourney, Flux, and Imagen are all variations on this shape.

The interesting parts of text-to-image as a discipline are not in the model — they’re in the prompt, the conditioning, and the failure modes.

How a prompt becomes an image

A prompt — say, “a cat sitting on a windowsill, studio lighting, photographic, 35mm” — is tokenized and run through the model’s text encoder. Modern systems use one or two encoders in parallel: CLIP for visual concepts, T5-XXL for richer language understanding. The text embeddings are injected into every cross-attention layer of the diffusion denoising network.

The diffusion model starts from a 64×64×4 latent of pure Gaussian noise and runs 20-50 denoising steps. At each step, it attends from image tokens to text tokens, predicts the noise present in the latent, subtracts a portion of it, and advances. After the final step, a VAE decoder converts the cleaned latent into pixels. Total inference time on consumer hardware: 1-5 seconds.

What the model actually controls

Knobs that affect output

Prompt content — what’s in the image. Subjects, settings, style words.
Negative prompt — what to avoid. “blurry, distorted, extra fingers, bad anatomy” remains the most common production negative prompt.
Classifier-free guidance scale — how strongly the model adheres to the prompt. Low (3-5) gives variety; high (10-15) gives literal-minded outputs.
Sampler — DPM-Solver++, Euler, Heun, DDIM. Different ODE solvers for the denoising trajectory. Trade speed against quality.
Step count — 20 steps is the modern default. More than 50 yields negligible improvement.
Seed — initial noise pattern. Same seed + same prompt = reproducible image.
Aspect ratio — most models work best at the resolutions in their training distribution; off-aspect generation degrades.

Where it breaks

Hands have fine articulated structure with limited training signal — a typical caption mentions “person” but not “five fingers in this configuration”. Diffusion produces locally plausible texture but global structure (correct count, correct articulation) requires reasoning the network is bad at. Flux and Imagen 3 mostly get hands right; older models produce 6-fingered abominations.

Text-in-images has a related but distinct problem. A 512×512 generation has only ~30 pixels of vertical resolution per character; the patch tokenizer averages it into mush. SDXL was nearly illiterate; SD3 and Flux fixed text rendering primarily by using larger text encoders (T5-XXL is much richer than CLIP) and curating training data with clean visible text.

They’re inverses with shared infrastructure. A VLM takes an image and produces text. A text-to-image model takes text and produces an image. Both rely on a shared embedding space between the two modalities — typically CLIP’s — to bridge the gap. Some recent systems (GPT-4o, Gemini, Chameleon) unify both directions in a single model that generates either modality token-by-token. This is the path to native multimodal models, but it’s still rarer than the specialized two-model pattern.

The post-training era

The base diffusion models you can download (Flux dev, SD 3.5) are aesthetically generic. Production-quality outputs come from fine-tuning. LoRA adapters are the dominant approach — train a small low-rank update on a few hundred images of a specific person, art style, or product. ControlNet conditions generation on a pose, depth map, or edge sketch for spatial control. IP-Adapter conditions on a reference image to copy style or composition. The image-generation community ships hundreds of these adapters per week; CivitAI exists primarily to host them.

The honest production reality: prompt engineering alone takes you to ~80% of what’s possible. The last 20% — consistent character, brand-faithful style, layout control — comes from LoRA, ControlNet, and prompt-rewriting pipelines built on top of the base model.

Go further

What does 'CLIP-conditioned' actually mean in practice?

The text prompt is fed through CLIP's text encoder (and often a second encoder like T5-XXL) to produce an embedding. That embedding is injected into every cross-attention layer of the diffusion model's denoising network. The image is then generated to match the conditioning. CLIP determines which prompts the model can understand; the diffusion model determines what they look like.

CLIP Diffusion model

Why is prompt engineering for images its own discipline?

Image diffusion models have biases baked into the dataset they were trained on — words like 'cinematic', 'hyperrealistic', '8k', 'studio lighting' weren't equally represented across all images, so they implicitly select for certain aesthetics. Negative prompts, weight syntax (parentheses), and aspect-ratio cues all materially affect output quality. The community-known incantations don't transfer between models.

Prompt engineering

How do controllable image-generation methods like ControlNet and LoRA fit in?

ControlNet adds a parallel network that conditions on extra spatial input (pose, depth, edges) for fine-grained layout control. LoRA fine-tunes a small low-rank adapter on a personal dataset (a face, an art style) without touching the base model. Both compose with text conditioning — text prompt for content, ControlNet for layout, LoRA for style. They are the production tools for getting specific images, not just plausible ones.

LoRA / PEFT

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs