Multi-Head Attention

Also known as: MHA, multi-head self-attention, multihead attention

TL;DR

Multi-head attention splits the attention computation into parallel heads, each with its own learned projections. Heads specialize on different relations — syntactic, semantic, positional — and their outputs are concatenated and projected. The default attention pattern in every modern transformer.

Multi-head attention runs the computation times in parallel, each head with its own learned projections of dimension . Outputs of all heads are concatenated and projected by a final . Every modern — GPT, Llama, Gemma — ships this pattern; the downstream variants ( , MQA) derive from it.

The formula

Each head produces an output; concatenation along the head axis recovers an tensor, and mixes them back into the residual stream. Total compute matches a single -dimensional attention — the wins come from what the heads compute, not how much.

Why this works — heads specialize

Different heads attend to different relations — and that division of labor is more useful than a single bigger head computing one averaged attention pattern.

work has identified specific roles for individual heads: previous-token heads copying from , induction heads completing repeated patterns (Olsson et al. 2022), subject-verb heads resolving syntactic agreement, name-mover heads surfacing the right entity. Many capabilities live in small circuits spanning a handful of heads across a few layers — one set of attention weights per token, independent views per layer.

The standard numbers

Head counts in the wild
  • GPT-3 175B, , .
  • Llama-3 70B query heads, , . Uses GQA with 8 K/V heads.
  • Gemma-2 9B, . The larger is unusual; most models stick near 128.
  • Mistral 7B, . Uses GQA with 8 K/V heads.

or is the standard, sized to tensor-core matmul tiles. The choice of falls out of — it’s not really a free hyperparameter.

The KV-cache story

Each head carries its own K and V projections, stored at inference in the — every previous token, every layer, every head. Full MHA on a 70B model at 128K context runs into tens of gigabytes per sequence, the bulk of LLM-serving memory pressure.

The historical arc: MHA (Vaswani et al. 2017); Multi-Query Attention (Shazeer 2019) collapsed all Q heads onto one shared K/V, an cache reduction with a small quality hit; (Ainslie et al. 2023) split the difference with groups sharing K/V. By 2026 every flagship open-weight model ships GQA — the query side is still genuinely multi-headed.

How it composes with the rest of the block

Causal masking is applied inside each head independently — the future-mask is per-head. fuses the softmax and matmuls across heads, treating the head axis as a parallelizable dimension. The sees the post- output as a single -dimensional vector; once heads are concatenated and mixed, the rest of the block is head-agnostic.

For head , define , , , each of shape . The head computes

— scaled dot-product on the head’s own -dimensional subspace. Each head sees a learned linear projection of the input and nothing else; it can’t observe what the other heads attend to until mixes them downstream. That subspace isolation is what lets heads specialize during training — gradient updates to head flow only through .

Go further

Why split attention into multiple heads instead of one bigger head?

With a single -dimensional attention, the softmax has to mediate every relation at once; splitting into heads of dimension lets each head specialize. Empirically, ablating individual heads removes specific capabilities — syntactic agreement, factual recall — as shown by Voita et al. 2019.

How is MHA different from GQA and MQA?

MHA has separate (Q, K, V) projections; MQA shares one K and one V across all Q heads; GQA splits the difference with groups of Q heads sharing K/V. Each step trades quality for KV-cache size.

What determines the number of heads?

is chosen so that or — the sweet spot for matmul efficiency on tensor cores. Llama-3 70B picks with . The ratio matters more than the absolute count.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord