Also known as: MHA, multi-head self-attention, multihead attention
TL;DR
Multi-head attention splits the attention computation into parallel heads, each with its own learned projections. Heads specialize on different relations — syntactic, semantic, positional — and their outputs are concatenated and projected. The default attention pattern in every modern transformer.
Multi-head attention runs the attention computation times in parallel, each head with its own learned projections of dimension . Outputs of all heads are concatenated and projected by a final . Every modern transformer — GPT, Llama, Gemma — ships this pattern; the downstream variants ( grouped-query attention , MQA) derive from it.
The formula
Each head produces an output; concatenation along the head axis recovers an tensor, and mixes them back into the residual stream. Total compute matches a single -dimensional attention — the wins come from what the heads compute, not how much.
Why this works — heads specialize
Different heads attend to different relations — and that division of labor is more useful than a single bigger head computing one averaged attention pattern.
Mechanistic interpretability work has identified specific roles for individual heads: previous-token heads copying from , induction heads completing repeated patterns (Olsson et al. 2022), subject-verb heads resolving syntactic agreement, name-mover heads surfacing the right entity. Many capabilities live in small circuits spanning a handful of heads across a few layers — one set of attention weights per token, independent views per layer.
Gemma-2 9B — , . The larger is unusual; most models stick near 128.
Mistral 7B — , . Uses GQA with 8 K/V heads.
or is the standard, sized to tensor-core matmul tiles. The choice of falls out of — it’s not really a free hyperparameter.
The KV-cache story
Each head carries its own K and V projections, stored at inference in the KV cache — every previous token, every layer, every head. Full MHA on a 70B model at 128K context runs into tens of gigabytes per sequence, the bulk of LLM-serving memory pressure.
The historical arc: MHA (Vaswani et al. 2017); Multi-Query Attention (Shazeer 2019) collapsed all Q heads onto one shared K/V, an cache reduction with a small quality hit; grouped-query attention (Ainslie et al. 2023) split the difference with groups sharing K/V. By 2026 every flagship open-weight model ships GQA — the query side is still genuinely multi-headed.
How it composes with the rest of the block
Causal masking is applied inside each head independently — the future-mask is per-head. Flash-attention fuses the softmax and matmuls across heads, treating the head axis as a parallelizable dimension. The feedforward sublayer sees the post- output as a single -dimensional vector; once heads are concatenated and mixed, the rest of the block is head-agnostic.
For head , define , , , each of shape . The head computes
— scaled dot-product attention on the head’s own -dimensional subspace. Each head sees a learned linear projection of the input and nothing else; it can’t observe what the other heads attend to until mixes them downstream. That subspace isolation is what lets heads specialize during training — gradient updates to head flow only through .
Go further
Why split attention into multiple heads instead of one bigger head?
With a single -dimensional attention, the softmax has to mediate every relation at once; splitting into heads of dimension lets each head specialize. Empirically, ablating individual heads removes specific capabilities — syntactic agreement, factual recall — as shown by Voita et al. 2019.
MHA has separate (Q, K, V) projections; MQA shares one K and one V across all Q heads; GQA splits the difference with groups of Q heads sharing K/V. Each step trades quality for KV-cache size.