Visual Question Answering

Also known as: VQA, image question answering, visual QA

TL;DR

Visual question answering (VQA) is the task of producing a natural-language answer to a question about an image. It is the canonical benchmark for vision-language models because it forces grounding.

Visual question answering is the task of producing a free-form answer to a natural-language question about an image. “How many people are in this photo?” “What color is the leftmost car?” “Is this person wearing a watch?” “What does the chart on slide 3 imply about Q3 revenue?” The answer can be a single word, a number, a phrase, or a multi-sentence explanation. VQA is the canonical evaluation surface for vision-language models because, unlike captioning, you can’t solve it by producing generic text that’s loosely consistent with the image — the question demands a specific grounded response.

VQA is also the building block for almost every practical multimodal application: document understanding, chart analysis, GUI navigation, visual reasoning agents, and the visual half of multimodal RAG . When people say a model is “good at vision,” what they usually mean is it scores well on a VQA-shaped task.

Why VQA is a strong evaluation harness

The benchmark forces grounding. A captioning model can hide weakness under generic descriptions (“a person standing in a room”) that are technically correct but content-free. A VQA model, asked “how many windows are in this room?”, has to commit to a specific number. There is no plausible-sounding hedge that passes.

VQA also stress-tests against prior bias. Early VQA datasets had a structural flaw: a model could answer “is the sky blue?” with “yes” and be right 90% of the time without looking at the image, because affirmative questions are over-represented. VQAv2 (2017) explicitly fixed this by collecting complementary image pairs — for each question, a second image where the answer is different — forcing models to actually read the image to do well.

The dataset landscape

Canonical VQA benchmarks

VQAv2 (Goyal 2017) — the original. 200K images, 1.1M questions, free-form answers. Largely solved — modern VLMs hit 85%+ vs human ceiling 92%.
GQA (Hudson 2019) — compositional reasoning over scene graphs. Questions like “are there more red cubes than blue spheres?” Forces multi-step visual reasoning, not single-object recognition.
MM-Vet (Yu 2023) — 200 questions designed to require integration of recognition, OCR, knowledge, language generation, spatial reasoning, and math. The current honest “is this VLM good?” evaluation.
MMMU (2023) — college-level multimodal reasoning across 30 subjects. Charts, diagrams, formulas. The hard exam for frontier VLMs.
OK-VQA / A-OKVQA — questions that require external knowledge (“what’s the name of this monument?”). Tests retrieval-augmented or world-knowledge grounding.
TextVQA / DocVQA / ChartQA — questions that demand reading text inside the image. OCR-heavy.
RealWorldQA (xAI 2024) — 700 photos with questions designed to be unambiguous and grounded in physical reality. Anti-benchmark-gaming.

How modern VLMs actually solve VQA

Step 1: image encoding. A frozen SigLIP or EVA-CLIP image encoder ingests the image and produces a sequence of patch token embeddings — typically 256-1024 tokens per image at 384-resolution.

Step 2: projection. A learned MLP (the “vision projector” or “abstractor”) maps each patch embedding into the LLM’s hidden dimension. Some models compress further here — Q-Former in BLIP-2 reduces 576 patch tokens to 32 query tokens.

Step 3: input assembly. The projected patch tokens are inserted into the LLM’s input sequence, typically prepended to the question text. Special tokens delimit the image region. The LLM sees something like [BOS] [IMG] <patch_1> ... <patch_N> [/IMG] question text [EOS].

Step 4: autoregressive decoding. The LLM generates the answer one token at a time, attending to both the patch tokens and the question. No special architecture changes — the LLM treats patches as just more input tokens.

The vision encoder and projector are typically pre-aligned on (image, caption) data, then the full stack is instruction-tuned on VQA-shaped datasets (LLaVA-Instruct, ShareGPT4V). The final model is a standard decoder-only LLM with extra “vision-shaped” tokens at the start of context.

Where VQA fails specifically

Counting. Asking “how many X are in the image” remains hard. VLMs can usually distinguish 0/1/2 reliably; beyond ~5, accuracy collapses. There’s no architectural mechanism for explicit counting; the model is reading off learned heuristics.

Spatial reasoning. “Is the cat to the left or right of the dog?” Modern models do okay on simple two-object cases, struggle on three or more or on relative depth.

Fine-grained OCR in dense images. Reading specific text in a screenshot or document is error-prone outside the OCR-specialized models. A VLM might read the headline correctly and hallucinate the body text.

Chart and graph reading. Reading exact values off a bar chart is famously bad — most VLMs are within ~30% of correct on chart-value extraction. ChartQA and PlotQA exist specifically to track this.

Grounding under question pressure. Asked a leading question (“isn’t there a cat in this image?”), VLMs are noticeably more likely to confabulate one. The instruction-tuning teaches them to be agreeable.

Counter-factual reasoning. “If the red car weren’t here, what would be in front of the building?” Models tend to just describe the actual scene rather than reason hypothetically.

Where VQA shows up in production

VQA is the backbone of any “ask a question about an image/PDF/screenshot” feature. Customer support copilots that read user-uploaded screenshots. Document-analysis tools that answer questions about contracts, invoices, or research papers. Accessibility tools that describe images to blind users. GUI agents that read what’s on the screen and decide where to click. The retrieval half of multimodal RAG finds the right image; the VQA model then reads it for a specific answer.

The honest take: VQA is now mostly a question of grounding fidelity, not capability. Modern VLMs (GPT-4o, Claude Sonnet, Gemini Pro, Qwen2-VL, InternVL2) can answer most reasonable questions about most reasonable images. The remaining frontier is reliability: hallucinating less, refusing more honestly when the image doesn’t support an answer, and reading dense visual content (charts, dense documents, schematics) without losing detail.

Go further

What's the difference between VQA and image captioning?

Captioning produces an open-ended description of what's in the image; VQA produces a targeted answer to a specific question. Captioning rewards generic plausibility ('a dog in a park'); VQA rewards specificity ('three dogs', 'the leftmost one is brown'). VQA is much harder because it can't be solved by recognizing the image's overall gist — the question dictates which detail matters.

Multimodal embeddings Hallucination

Why are VQAv2 scores so close to ceiling now?

VQAv2 (2017) is largely solved — modern VLMs hit 85%+ vs human ceiling around 92%. The benchmark's question distribution skews toward simple recognition (color, count, object presence) which large pretraining handles. The interesting frontier moved to MM-Vet, MMMU, RealWorldQA, and similar benchmarks that demand multi-step reasoning, OCR, chart understanding, or expert knowledge.

Large language model

How does a modern VLM actually answer a VQA question mechanically?

Image encoder (SigLIP or similar) produces patch tokens; a learned projector maps them into the LLM's hidden dim; the projected patches are prepended to the question tokens; the LLM autoregressively generates the answer conditioned on both. The image encoder and projector are usually trained on alignment data, then the whole stack is instruction-tuned on VQA datasets.

Image encoder Instruction tuning

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs