Attention

Also known as: self-attention, scaled dot-product attention, multi-head attention

TL;DR

Attention is the mechanism that lets a token in a sequence dynamically read from any other token's representation, weighted by a learned similarity.

Attention is the mechanism that lets a token in a sequence dynamically read from any other token’s representation, weighted by a learned measure of how relevant that other token is. It’s the core operation behind every modern — and by extension, every , embedding model, and reranker.

The intuition

Take “The cat sat on the mat because it was tired.” The pronoun “it” refers to “the cat”, but in the raw token stream “it” is a 2-letter token with no inherent connection to its antecedent. The model needs to look back and decide which earlier token “it” points at.

Attention is how it does that. Each token’s representation is updated by a weighted combination of every other token’s representation, with weights learned from data. Earlier networks (RNNs, LSTMs) passed information sequentially through a single hidden state, giving every token the same blunt look-back. Attention lets the model decide, per query, which positions are relevant — with different weights for different relationships.

Scaled dot-product attention

Given a sequence of tokens with hidden dimension , attention takes three inputs:

  • queries: what is each token looking for?
  • keys: what does each token offer?
  • values: what does each token pass on if attended to?

Each is produced by a learned linear projection of the input :

The attention output is:

Position by position, the output for token is a weighted sum over the value vectors of all positions:

The are attention weights — non-negative, summing to 1 across for each . They form an matrix that’s the most direct window into what the model is “looking at”.

The scaling factor prevents the softmax from saturating in high dimensions. Without it, dot products grow as , pushing the softmax toward one-hot distributions and killing the gradient. Dividing by keeps the variance roughly fixed across dimensions.

SCALED DOT-PRODUCT ATTENTIONHow a token reads from the rest."The"POS 0"cat"POS 1"sat"POS 2"on"POS 3QKVQKVQKVQKVSELECTED QUERY · q2RAW SCORES · q · kj1.204.702.100.80÷ √dk( = 8 )ATTENTION WEIGHTS · softmax → α2j0.22α2,00.34α2,10.24α2,20.21α2,3OUTPUT · o2= Σ α2j· vj.65.43.62.27.52.56EVERY TOKEN GETS A QUERY, KEY, AND VALUE
SCALED DOT-PRODUCT ATTENTIONThe math, with actual numbers.X (3×4)thecatsat1.00.01.00.00.01.00.01.01.01.00.00.0@W_Q1.0.50.0000.5.00.5001.0.00.5000.5.50.000=Q1.00.00.50.50.01.00.50.50.50.50.50.5W_K0.5.00.5001.0.50.0000.0.50.5001.5.00.000=K0.00.50.51.01.00.50.50.00.50.50.50.5W_V1.0.00.5000.5.50.0001.5.00.0000.0.50.500=V1.00.50.00.50.00.51.00.50.50.50.50.5Q · Kᵀ11.31.001.81.0011.01.003×3 SCORES÷2÷ √dₖ0.6.501.4.501.5.50√4 = 2softmaxrowsATTENTION A0.4.330.3.330.3.33Σ=1Σ=1Σ=1@V1.00.50.00.50.00.51.00.50.50.50.50.5=OUTPUT0.5.54.5001.5.46.5001.5.50.500STEP 1: THREE TOKENS, EACH A FOUR-DIM VECTOR

Multi-head attention

A single attention computation can only attend in one way at a time. Multi-head attention runs independent attentions in parallel with their own projections, then concatenates:

In practice ranges from 8 to 64, with per-head dimension so total compute matches a single full-dimensional attention. The concatenation has the same dimension as the input, letting you stack transformer blocks without changing shape.

MULTI-HEAD ATTENTIONh attentions in parallel — each head, a different pattern"The""cat""sat""on""mat"INPUTHEAD 1 · LOCALThecatsatonmatThecatsatonmatsoftmax(Q1K1ᵀ/√d_k)HEAD 2 · COREFERENCEThecatsatonmatThecatsatonmatsoftmax(Q2K2ᵀ/√d_k)HEAD 3 · ANCHORThecatsatonmatThecatsatonmatsoftmax(Q3K3ᵀ/√d_k)HEAD 4 · CONTENTThecatsatonmatThecatsatonmatsoftmax(Q4K4ᵀ/√d_k)CONCATW_OOUTPUTMultiHead(X) = Concat(head₁, …, head_h) · W_O

Self-attention vs cross-attention

  • Self-attention, , all come from the same sequence. The standard inside encoder and decoder blocks. Each token attends to other tokens of the same sequence.
  • Cross-attention comes from one sequence; and come from another. Used in encoder-decoder transformers where the decoder reads from the encoder’s output, and in retrieval-augmented setups.

Causal masking

In decoder-only transformers (the workhorse), each token must only attend to tokens that came before it. This is enforced by a mask: before the softmax, set to for any . The softmax of is 0, so future tokens contribute nothing.

This is what makes generation autoregressive: position ‘s output depends only on positions . The whole sequence trains in parallel, generation runs token-by-token at inference, and the math stays consistent across both modes.

Compute considerations

Self-attention is in sequence length and key dimension. The quadratic scaling in is the wall behind the . FlashAttention reorders the computation to be I/O-efficient (it never materializes the full attention matrix); sparse attention patterns (sliding window, BigBird, Longformer) reduce the constant factor; but the fundamental quadratic scaling is unforgiving.

Every modern transformer is, at its core, a stack of attention layers — and every engineering win in LLM serving is a win against the quadratic.

Take two random query and key vectors with IID unit-variance components. Their dot product is a sum of such products, with mean zero and variance — standard deviation grows as .

When , dot products typically have magnitude around 8. When , around 64. Plug values like into the softmax and almost all mass concentrates on the largest element — essentially one-hot. The gradient on the runners-up is vanishingly small, and the model can’t learn from relative ordering.

Dividing by rescales dot products back to roughly unit variance regardless of dimension. Looks like a minor implementation detail; without it, attention doesn’t train at scale.

A single attention head with has the same total parameters and compute as heads with . Empirically the multi-head version wins by a substantial margin.

A single head can only produce one set of attention weights per token. Different relationships in language — coreference, syntactic dependency, topical similarity, positional adjacency — have very different attention patterns. A single head has to compromise; multi-head lets each head specialize and the output projection combines them.

Head pruning studies find that a substantial fraction of heads can be removed without quality loss at inference — they’re useful during training but redundant after. This is the tension GQA and MQA exploit: keep head diversity for queries, share keys and values to shrink the KV cache.

Where it shows up beyond LLMs

used as rerankers are transformer-attention applied to (query, document) pairs concatenated as input. embeddings come from running self-attention over a document and pooling the resulting representations. Vision transformers attend over image patches; speech models over audio frames; protein language models over amino acid sequences. The same primitive does all the work — only the inputs change.

Go further

How is attention different from convolution or recurrence?

Convolutions look at fixed-size local neighborhoods; recurrence passes information sequentially through a hidden state. Attention is dynamic and global — at every step, every position decides for itself which other positions to read, weighted by learned similarity. The 'which positions to read' is a function of the input, not a fixed structural prior.

Why is the dot product the similarity used in attention?

Speed and gradient flow. Dot products are cheap (one fused multiply-add per dimension) and produce smooth gradients. Scaled dot-product attention divides by √d_k to keep the values from growing too large with embedding dimension. The same dot product underlies [cosine similarity](/concepts/cosine-similarity/) in retrieval — different application, same math.

What's multi-head attention and why does it help?

Instead of one attention computation, run several in parallel with different learned projections, then concatenate. Different 'heads' end up specializing — some attend to syntactic relationships, some to coreference, some to topic. The diversity of attention patterns is more useful than a single bigger attention head with the same total compute.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord