Attention

Also known as: self-attention, scaled dot-product attention, multi-head attention

TL;DR

Attention is the mechanism that lets a token in a sequence dynamically read from any other token's representation, weighted by a learned similarity.

Attention is the mechanism that lets a token in a sequence dynamically read from any other token’s representation, weighted by a learned measure of how relevant that other token is. It’s the core operation behind every modern transformer — and by extension, every LLM , embedding model, and reranker.

The intuition

Take “The cat sat on the mat because it was tired.” The pronoun “it” refers to “the cat”, but in the raw token stream “it” is a 2-letter token with no inherent connection to its antecedent. The model needs to look back and decide which earlier token “it” points at.

Attention is how it does that. Each token’s representation is updated by a weighted combination of every other token’s representation, with weights learned from data. Earlier networks (RNNs, LSTMs) passed information sequentially through a single hidden state, giving every token the same blunt look-back. Attention lets the model decide, per query, which positions are relevant — with different weights for different relationships.

Scaled dot-product attention

Given a sequence of tokens with hidden dimension , attention takes three inputs:

— queries: what is each token looking for?
— keys: what does each token offer?
— values: what does each token pass on if attended to?

Each is produced by a learned linear projection of the input :

The attention output is:

Position by position, the output for token is a weighted sum over the value vectors of all positions:

The are attention weights — non-negative, summing to 1 across for each . They form an matrix that’s the most direct window into what the model is “looking at”.

The scaling factor prevents the softmax from saturating in high dimensions. Without it, dot products grow as , pushing the softmax toward one-hot distributions and killing the gradient. Dividing by keeps the variance roughly fixed across dimensions.

Multi-head attention

A single attention computation can only attend in one way at a time. Multi-head attention runs independent attentions in parallel with their own projections, then concatenates:

In practice ranges from 8 to 64, with per-head dimension so total compute matches a single full-dimensional attention. The concatenation has the same dimension as the input, letting you stack transformer blocks without changing shape.

Self-attention vs cross-attention

Self-attention — , , all come from the same sequence. The standard inside encoder and decoder blocks. Each token attends to other tokens of the same sequence.
Cross-attention — comes from one sequence; and come from another. Used in encoder-decoder transformers where the decoder reads from the encoder’s output, and in retrieval-augmented setups.

Causal masking

In decoder-only transformers (the LLM workhorse), each token must only attend to tokens that came before it. This is enforced by a mask: before the softmax, set to for any . The softmax of is 0, so future tokens contribute nothing.

This is what makes generation autoregressive: position ‘s output depends only on positions . The whole sequence trains in parallel, generation runs token-by-token at inference, and the math stays consistent across both modes.

Compute considerations

Self-attention is in sequence length and key dimension. The quadratic scaling in is the wall behind the context window . FlashAttention reorders the computation to be I/O-efficient (it never materializes the full attention matrix); sparse attention patterns (sliding window, BigBird, Longformer) reduce the constant factor; but the fundamental quadratic scaling is unforgiving.

Every modern transformer is, at its core, a stack of attention layers — and every engineering win in LLM serving is a win against the quadratic.

Take two random query and key vectors with IID unit-variance components. Their dot product is a sum of such products, with mean zero and variance — standard deviation grows as .

When , dot products typically have magnitude around 8. When , around 64. Plug values like into the softmax and almost all mass concentrates on the largest element — essentially one-hot. The gradient on the runners-up is vanishingly small, and the model can’t learn from relative ordering.

Dividing by rescales dot products back to roughly unit variance regardless of dimension. Looks like a minor implementation detail; without it, attention doesn’t train at scale.

A single attention head with has the same total parameters and compute as heads with . Empirically the multi-head version wins by a substantial margin.

A single head can only produce one set of attention weights per token. Different relationships in language — coreference, syntactic dependency, topical similarity, positional adjacency — have very different attention patterns. A single head has to compromise; multi-head lets each head specialize and the output projection combines them.

Head pruning studies find that a substantial fraction of heads can be removed without quality loss at inference — they’re useful during training but redundant after. This is the tension GQA and MQA exploit: keep head diversity for queries, share keys and values to shrink the KV cache.

Where it shows up beyond LLMs

Cross-encoders used as rerankers are transformer-attention applied to (query, document) pairs concatenated as input. Bi-encoder embeddings come from running self-attention over a document and pooling the resulting representations. Vision transformers attend over image patches; speech models over audio frames; protein language models over amino acid sequences. The same primitive does all the work — only the inputs change.

Go further

How is attention different from convolution or recurrence?

Convolutions look at fixed-size local neighborhoods; recurrence passes information sequentially through a hidden state. Attention is dynamic and global — at every step, every position decides for itself which other positions to read, weighted by learned similarity. The 'which positions to read' is a function of the input, not a fixed structural prior.

Transformer Large language model

Why is the dot product the similarity used in attention?

Speed and gradient flow. Dot products are cheap (one fused multiply-add per dimension) and produce smooth gradients. Scaled dot-product attention divides by √d_k to keep the values from growing too large with embedding dimension. The same dot product underlies [cosine similarity](/concepts/cosine-similarity/) in retrieval — different application, same math.

Cosine similarity Embedding

What's multi-head attention and why does it help?

Instead of one attention computation, run several in parallel with different learned projections, then concatenate. Different 'heads' end up specializing — some attend to syntactic relationships, some to coreference, some to topic. The diversity of attention patterns is more useful than a single bigger attention head with the same total compute.

Transformer

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs