Vision-Language Model (VLM)

Also known as: VLM, multimodal LLM, MLLM, vision LLM

TL;DR

A vision-language model is an LLM that can see. Image patch embeddings are projected into the LLM's token space and concatenated with text tokens; the model treats them as a uniform sequence and generates text autoregressively.

A vision-language model is a large language model with image input. The image is run through a frozen ViT to produce patch token embeddings; those are projected into the LLM’s embedding space by a small MLP and concatenated into the prompt as if they were ordinary tokens. The LLM then attends jointly over image and text tokens and generates a text response autoregressively . There is no separate “vision module” producing answers — the LLM does the reasoning end-to-end.

This is the architecture behind GPT-4V, Claude vision, Gemini, LLaVA, Qwen-VL, Pixtral, and every other 2023-onward multimodal LLM. The unification is striking: a single decoder transformer reasons across modalities through nothing more than a token-space alignment trick.

Vision encoder, projection, LLM tokens

A 336×336 image at 14×14 patch size gives 576 patches. A frozen CLIP -trained or SigLIP-trained ViT produces 576 hidden states of dimension ~1024. A learned 2-layer MLP projects each to the LLM’s embedding dimension (4096 for Llama-3-8B). These 576 tokens are inserted into the prompt at the <image> placeholder.

The connector is the entire interface between vision and language. It is small (~20M params), cheap to train, and the only piece that’s randomly initialized. The choice of vision encoder, the projection design, and the resolution of the image all directly determine what the LLM can see.

Capabilities

What modern VLMs can do

Image description — generate captions, answer “what’s in this image” questions.
Visual question answering — read a chart, count objects, describe relationships.
OCR-by-default — modern VLMs read printed and handwritten text inside images well enough to obsolete dedicated OCR pipelines for many use cases.
Document understanding — extract structured data from scanned PDFs, invoices, forms.
UI agent perception — vision is the substrate for browser-using and computer-using agents.
Visual grounding — pointing at things in the image (with bounding boxes or coordinates emitted as text).

Limits

Counting. A VLM asked “how many people are in this image” answers correctly up to ~5 and degrades from there. The patch tokenizer doesn’t preserve fine-grained spatial detail well enough.

Spatial reasoning. “Which object is to the left of the cup?” works some of the time. Position embeddings inside the ViT encode patch coordinates, but the LLM never directly sees them as numbers — it has to infer geometry from attention patterns.

Reading dense text. Pre-tiling, a single 336×336 image has ~2 megapixels of resolution effective. A book page has 5x that. Modern VLMs use multi-crop strategies but quality still degrades on small text.

Image generation. Most VLMs are read-only. Generating images from an LLM requires either an external diffusion model in the loop or autoregressive image-token generation (GPT-4o, Chameleon) which is far rarer.

The standard recipe — popularized by LLaVA — uses GPT-4 (or now GPT-4V, Claude) to synthesize (image, instruction, response) triples from a smaller pool of human-captioned images. Given an image’s caption and bounding-box annotations, the bigger model is prompted to write a question and an answer about the image. The student VLM is then SFT’d on these triples.

This works because the bigger model never sees the image directly during synthesis — it sees the structured annotations — so the resulting questions are grounded in real visual content even though the synthesis pipeline is text-only. Roughly 150k-1M synthesized triples produce a competent VLM. Quality of the synthesis prompt is the limiting factor.

The architectural shape that won

The interesting thing about the VLM era is how little architecture was needed. No cross-attention layers, no fancy fusion — just project image features to the LLM’s token space and let attention handle it. This is the same structural lesson as multimodal embeddings : alignment to a shared representation is the load-bearing trick. Once images live in the LLM’s embedding space, all the LLM’s reasoning machinery applies for free.

Go further

How are image tokens actually constructed?

A frozen ViT (often CLIP- or SigLIP-trained) processes the image into N patch tokens — typically 256 for 224×224 input or 576 for 336×336. A learned MLP projection (the 'connector' in LLaVA) maps each patch's hidden state to the LLM's embedding dimension. The resulting tokens are inserted into the prompt sequence at the position where the image appeared.

Vision Transformer Image encoder

How does training work for a VLM?

Two stages. First, alignment: freeze the LLM and the vision encoder, train only the projection MLP on (image, caption) pairs so it learns to speak the LLM's token language. Second, instruction tuning: unfreeze the LLM (and sometimes the vision encoder) and train on (image, instruction, response) triples — usually synthesized by GPT-4V over a smaller pool of images.

Instruction tuning Supervised fine-tuning

Why do VLMs hallucinate so much on images?

Because the LLM is trained to generate plausible text and the vision tokens are a relatively weak conditioning signal compared to the text prompt. If the prompt is leading ('describe the cat in this image'), the model will describe a cat even if there isn't one. Visual grounding remains an unsolved problem — the field name for the failure is 'object hallucination'.

Hallucination Visual question answering

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs