Multi-Head Attention

Q: Why split attention into multiple heads instead of one bigger head?

With a single FORMULA-dimensional attention, the softmax has to mediate every relation at once; splitting into FORMULA heads of dimension FORMULA lets each head specialize. Empirically, ablating individual heads removes specific capabilities — syntactic agreement, factual recall — as shown by Voita et al. 2019.

Q: How is MHA different from GQA and MQA?

MHA has FORMULA separate (Q, K, V) projections; MQA shares one K and one V across all FORMULA Q heads; GQA splits the difference with FORMULA groups of Q heads sharing K/V. Each step trades quality for KV-cache size.

Q: What determines the number of heads?

FORMULA is chosen so that FORMULA or FORMULA — the sweet spot for matmul efficiency on tensor cores. Llama-3 70B picks FORMULA with FORMULA. The ratio matters more than the absolute count.

Also known as: MHA, multi-head self-attention, multihead attention

TL;DR

Multi-head attention splits the attention computation into parallel heads, each with its own learned projections. Heads specialize on different relations — syntactic, semantic, positional — and their outputs are concatenated and projected. The default attention pattern in every modern transformer.

Multi-head attention runs the attention computation times in parallel, each head with its own learned projections of dimension . Outputs of all heads are concatenated and projected by a final . Every modern transformer — GPT, Llama, Gemma — ships this pattern; the downstream variants ( grouped-query attention , MQA) derive from it.

The formula

Each head produces an output; concatenation along the head axis recovers an tensor, and mixes them back into the residual stream. Total compute matches a single -dimensional attention — the wins come from what the heads compute, not how much.

Why this works — heads specialize

Different heads attend to different relations — and that division of labor is more useful than a single bigger head computing one averaged attention pattern.

Mechanistic interpretability work has identified specific roles for individual heads: previous-token heads copying from , induction heads completing repeated patterns (Olsson et al. 2022), subject-verb heads resolving syntactic agreement, name-mover heads surfacing the right entity. Many capabilities live in small circuits spanning a handful of heads across a few layers — one set of attention weights per token, independent views per layer.

The standard numbers

Head counts in the wild

GPT-3 175B — , , .
Llama-3 70B — query heads, , . Uses GQA with 8 K/V heads.
Gemma-2 9B — , . The larger is unusual; most models stick near 128.
Mistral 7B — , . Uses GQA with 8 K/V heads.

or is the standard, sized to tensor-core matmul tiles. The choice of falls out of — it’s not really a free hyperparameter.

The KV-cache story

Each head carries its own K and V projections, stored at inference in the KV cache — every previous token, every layer, every head. Full MHA on a 70B model at 128K context runs into tens of gigabytes per sequence, the bulk of LLM-serving memory pressure.

The historical arc: MHA (Vaswani et al. 2017); Multi-Query Attention (Shazeer 2019) collapsed all Q heads onto one shared K/V, an cache reduction with a small quality hit; grouped-query attention (Ainslie et al. 2023) split the difference with groups sharing K/V. By 2026 every flagship open-weight model ships GQA — the query side is still genuinely multi-headed.

How it composes with the rest of the block

Causal masking is applied inside each head independently — the future-mask is per-head. Flash-attention fuses the softmax and matmuls across heads, treating the head axis as a parallelizable dimension. The feedforward sublayer sees the post- output as a single -dimensional vector; once heads are concatenated and mixed, the rest of the block is head-agnostic.

For head , define , , , each of shape . The head computes

— scaled dot-product attention on the head’s own -dimensional subspace. Each head sees a learned linear projection of the input and nothing else; it can’t observe what the other heads attend to until mixes them downstream. That subspace isolation is what lets heads specialize during training — gradient updates to head flow only through .

Go further

Why split attention into multiple heads instead of one bigger head?

With a single -dimensional attention, the softmax has to mediate every relation at once; splitting into heads of dimension lets each head specialize. Empirically, ablating individual heads removes specific capabilities — syntactic agreement, factual recall — as shown by Voita et al. 2019.

Mechanistic interpretability

How is MHA different from GQA and MQA?

MHA has separate (Q, K, V) projections; MQA shares one K and one V across all Q heads; GQA splits the difference with groups of Q heads sharing K/V. Each step trades quality for KV-cache size.

Grouped-Query Attention KV cache

What determines the number of heads?

is chosen so that or — the sweet spot for matmul efficiency on tensor cores. Llama-3 70B picks with . The ratio matters more than the absolute count.

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs