Grouped-Query Attention (GQA)

Also known as: GQA, MQA, Multi-Query Attention, grouped query attention

TL;DR

Grouped-Query Attention shares a single key/value head across a group of query heads, shrinking the KV cache by the group factor with negligible quality loss.

Grouped-Query Attention is a variant of multi-head in which several query heads share a single set of key and value projections. The number of K/V heads (groups) is smaller than the number of query heads — typically 8 K/V heads for 64 query heads in modern open-weight models. The architectural change is small; the deployment impact is enormous, because it shrinks the by the same factor.

GROUPED-QUERY ATTENTIONThe spectrum: many K/V → few K/V → one K/V.h=8 · KV heads = 8MHAQUERY HEADS · h=8Q1Q2Q3Q4Q5Q6Q7Q8KV HEADS · 8KVkv1KVkv2KVkv3KVkv4KVkv5KVkv6KVkv7KVkv8 BASELINEfull quality · full memoryh=8 · KV groups = 2GQAQUERY HEADS · h=8Q1Q2Q3Q4Q5Q6Q7Q8KV GROUPS · 2KVg1 · 4 QsKVg2 · 4 Qs SMALLERsweet spot · ~zero quality lossh=8 · KV heads = 1MQAQUERY HEADS · h=8Q1Q2Q3Q4Q5Q6Q7Q8KV HEAD · 1 (shared)KVshared SMALLERcheapest · noticeable quality lossMHA · ONE K/V PER QUERY HEAD

Why this matters

The bottleneck of LLM serving is not compute — it’s memory bandwidth, and specifically KV-cache memory. For a 70B-class model at 8K context, the KV cache for a single sequence is gigabytes. With full multi-head attention, batching dozens of long-context requests would blow past any reasonable VRAM budget. GQA pulls that down by the group factor (8x for Llama-3) so the same hardware can serve far more concurrent requests, or far longer contexts per request.

The quality cost is small enough to be lost in benchmark noise. The throughput gain is the difference between a viable production deployment and a prototype.

The math

Standard multi-head attention with heads has separate sets of projections. Each head produces its own at dimension . The KV cache stores for every head.

GQA partitions the query heads into groups of heads each. Each group shares one but keeps its own per head. The KV cache now stores pairs of instead of — a factor of smaller. At attention time, each query head computes against the K of its group; multiple query heads attend to the same K/V.

When (one group), GQA degenerates to Multi-Query Attention: all queries share one K/V. When (every head its own group), GQA recovers full MHA. The interesting regime is in between — Llama-3 uses for .

The KV cache stores, per layer per token, one K and one V vector per K/V head. With heads it stores vectors per token per layer; with groups it stores . Everything else — sequence length, layer count, dimension per head — is unchanged. So memory drops by the ratio . At Llama-3-70B’s 64 query heads to 8 K/V heads, that’s an 8x reduction. The same factor applies to memory bandwidth during attention computation, which is what matters for decode-step latency.

Why quality holds up

The intuitive worry: aren’t you losing 8x the attention diversity? The empirical answer is no, and the reason is that diversity lives mostly in the queries.

Different attention heads attend to different things — syntactic relations, coreference, positional patterns. But the variation across heads in what gets stored (the K/V projections) is much smaller than the variation in what gets asked for (the Q projection). Compressing K/V across heads is less destructive than it would be to compress Q. Llama 2’s GQA paper showed under 0.5 perplexity loss on most benchmarks at 8x compression.

What models use it

By 2026, every flagship open-weight LLM ships GQA. Llama-2 70B was the first major release to use it; Llama-3, Llama-4, Qwen, Mistral, DeepSeek, Gemma — all GQA. Frontier proprietary models almost certainly use it too (architecture details aren’t published, but the compute math forces something like it). The exception is some specialty research checkpoints that keep full MHA for ablation purposes.

How it composes with the rest of the inference stack

GQA stacks cleanly with , which doesn’t care how many K/V heads exist, and with prefix caching (in fact prefix caching is more valuable when KV cache is smaller, since more cache fits per request). It’s also what makes very long contexts tractable: at 128K context, GQA is the difference between 1 GB and 8 GB of KV cache per sequence, which is the difference between batch=8 and batch=1 on an A100.

GQA is purely Pareto-improving over MHA at production scale. It’s the default now for the same reason rotary positional encoding is the default — the alternative is leaving free quality and free throughput on the table.

Go further

What's the exact difference between MHA, MQA, and GQA?

MHA gives every query head its own K and V projections; MQA collapses all query heads onto one shared K/V; GQA splits query heads into G groups, each sharing a K/V. MHA is the most expressive and most expensive; MQA is the cheapest but sometimes degrades quality; GQA picks G to recover most of MHA's quality at most of MQA's savings.

How is a GQA model trained — from scratch or converted?

Both happen. Llama 2 70B was pretrained with GQA from scratch. The GQA paper (Ainslie et al., Google, 2023) showed you can also convert a trained MHA model by mean-pooling its K/V heads into the target group count and fine-tuning briefly — about 5% of original training compute recovers most of the lost quality. (Shazeer 2019 introduced MQA but did not propose this conversion recipe; that is an Ainslie 2023 contribution.)

Why doesn't quality collapse when you share K/V across heads?

Most attention diversity lives in the query projections, not the keys/values. Heads differ mostly in what they look for, not in what they project the sequence into. So compressing K/V down to a few groups loses less than it would seem — empirically, 8 KV heads for 64 query heads costs under one percent on most benchmarks.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord