Also known as: GQA, MQA, Multi-Query Attention, grouped query attention
TL;DR
Grouped-Query Attention shares a single key/value head across a group of query heads, shrinking the KV cache by the group factor with negligible quality loss.
Grouped-Query Attention is a variant of multi-head attention in which several query heads share a single set of key and value projections. The number of K/V heads (groups) is smaller than the number of query heads — typically 8 K/V heads for 64 query heads in modern open-weight models. The architectural change is small; the deployment impact is enormous, because it shrinks the KV cache by the same factor.
Why this matters
The bottleneck of LLM serving is not compute — it’s memory bandwidth, and specifically KV-cache memory. For a 70B-class model at 8K context, the KV cache for a single sequence is gigabytes. With full multi-head attention, batching dozens of long-context requests would blow past any reasonable VRAM budget. GQA pulls that down by the group factor (8x for Llama-3) so the same hardware can serve far more concurrent requests, or far longer contexts per request.
The quality cost is small enough to be lost in benchmark noise. The throughput gain is the difference between a viable production deployment and a prototype.
The math
Standard multi-head attention with heads has separate sets of projections. Each head produces its own at dimension . The KV cache stores for every head.
GQA partitions the query heads into groups of heads each. Each group shares one but keeps its own per head. The KV cache now stores pairs of instead of — a factor of smaller. At attention time, each query head computes against the K of its group; multiple query heads attend to the same K/V.
When (one group), GQA degenerates to Multi-Query Attention: all queries share one K/V. When (every head its own group), GQA recovers full MHA. The interesting regime is in between — Llama-3 uses for .
The KV cache stores, per layer per token, one K and one V vector per K/V head. With heads it stores vectors per token per layer; with groups it stores . Everything else — sequence length, layer count, dimension per head — is unchanged. So memory drops by the ratio . At Llama-3-70B’s 64 query heads to 8 K/V heads, that’s an 8x reduction. The same factor applies to memory bandwidth during attention computation, which is what matters for decode-step latency.
Why quality holds up
The intuitive worry: aren’t you losing 8x the attention diversity? The empirical answer is no, and the reason is that diversity lives mostly in the queries.
Different attention heads attend to different things — syntactic relations, coreference, positional patterns. But the variation across heads in what gets stored (the K/V projections) is much smaller than the variation in what gets asked for (the Q projection). Compressing K/V across heads is less destructive than it would be to compress Q. Llama 2’s GQA paper showed under 0.5 perplexity loss on most benchmarks at 8x compression.
What models use it
By 2026, every flagship open-weight LLM ships GQA. Llama-2 70B was the first major release to use it; Llama-3, Llama-4, Qwen, Mistral, DeepSeek, Gemma — all GQA. Frontier proprietary models almost certainly use it too (architecture details aren’t published, but the compute math forces something like it). The exception is some specialty research checkpoints that keep full MHA for ablation purposes.
How it composes with the rest of the inference stack
GQA stacks cleanly with flash-attention , which doesn’t care how many K/V heads exist, and with prefix caching (in fact prefix caching is more valuable when KV cache is smaller, since more cache fits per request). It’s also what makes very long contexts tractable: at 128K context, GQA is the difference between 1 GB and 8 GB of KV cache per sequence, which is the difference between batch=8 and batch=1 on an A100.
GQA is purely Pareto-improving over MHA at production scale. It’s the default now for the same reason rotary positional encoding is the default — the alternative is leaving free quality and free throughput on the table.
Go further
What's the exact difference between MHA, MQA, and GQA?
MHA gives every query head its own K and V projections; MQA collapses all query heads onto one shared K/V; GQA splits query heads into G groups, each sharing a K/V. MHA is the most expressive and most expensive; MQA is the cheapest but sometimes degrades quality; GQA picks G to recover most of MHA's quality at most of MQA's savings.
How is a GQA model trained — from scratch or converted?
Both happen. Llama 2 70B was pretrained with GQA from scratch. The GQA paper (Ainslie et al., Google, 2023) showed you can also convert a trained MHA model by mean-pooling its K/V heads into the target group count and fine-tuning briefly — about 5% of original training compute recovers most of the lost quality. (Shazeer 2019 introduced MQA but did not propose this conversion recipe; that is an Ainslie 2023 contribution.)
Why doesn't quality collapse when you share K/V across heads?
Most attention diversity lives in the query projections, not the keys/values. Heads differ mostly in what they look for, not in what they project the sequence into. So compressing K/V down to a few groups loses less than it would seem — empirically, 8 KV heads for 64 query heads costs under one percent on most benchmarks.