Causal Masking

Also known as: causal attention, masked attention, autoregressive mask, triangular mask, causal mask

TL;DR

Causal masking is the lower-triangular attention mask that prevents each token from seeing tokens to its right. It is the architectural commitment that makes a transformer autoregressive — the load-bearing difference between encoder and decoder attention.

Causal masking sets the (query, key) score to whenever the key position is to the right of the query position. After softmax those entries are exactly zero, so each token’s output depends only on itself and the tokens before it. It is one matrix, applied once per attention block, and it is the entire structural reason a is autoregressive.

The mask matrix

For a sequence of length , define :

Masked attention is then . Adding pre-softmax zeroes those entries post-softmax. Lower triangle (with diagonal) survives; upper triangle vanishes. The same mask is broadcast across every head of a multi-head attention block.

Why it matters

This is the architectural commitment that turns a bidirectional transformer into an autoregressive one. Without it, each position sees both past and future during training — the BERT regime, masked-token reconstruction over filled-in spans. With it, each position predicts only the next token from its prefix — the GPT regime, next-token prediction over the whole sequence. Same transformer block; different mask; different family of models.

The KV cache connection

Causality is what makes the correct. Past K and V tensors never change as new tokens arrive — token has no edge into token ‘s representation — so cached K/V from earlier steps remain valid forever. Drop the mask and every cached row would need recomputation each step. The KV cache is the direct accounting consequence of for , not a clever trick layered on.

The training-inference parallelism gain

With causal masking, an entire sequence trains in one forward pass: the next-token losses are independent because each depends only on its prefix, so gradients for all positions are computed at once. At inference the same model runs token-by-token via . Parallel during training, sequential during inference, identical weights — the reason decoder-only pretraining scales and training, with its serial cross-attention, does not as cleanly.

Where masking patterns differ
  • Encoder-only (BERT, RoBERTa). No mask. Full bidirectional attention. The default for .
  • Decoder-only (GPT, Llama, Claude). Strict causal mask. The default for autoregressive LLMs.
  • Encoder-decoder (T5, BART). Encoder is bidirectional; decoder is causal; decoder-to-encoder cross-attention is unmasked.
  • Prefix-LM (T5 v1.1, UL2). Bidirectional on a prefix, causal after — a block-structured mask.

Causal masking is the single load-bearing line of code that separates a generative LLM from a bidirectional encoder. The KV cache, the training-inference parallelism asymmetry, the encoder/decoder split — all of it follows from that one .

Softmax is computed via the log-sum-exp identity with , which keeps the exponentials bounded above by . A masked entry of produces exactly — it contributes nothing to the row’s normalizer.

In floating point, implementations use a large negative constant like rather than literal ; the corresponding exponential underflows to zero, which is identical to exclusion. goes further — it never materializes the score matrix, skips upper-triangular tiles entirely, and applies the mask inside the tile-level reduction. The mask costs nothing at the FLOP level; it is purely a constraint on the dependency graph.

Go further

What does the mask actually do mechanically?

Pre-softmax, set the attention scores at future positions to ; after softmax those entries are exactly zero, so future tokens contribute nothing to the current token's output. The operation is cheap (no extra compute, just a constant additive matrix) and exact — there is no probability mass to leak.

Why don't encoder models use causal masking?

Encoder models like BERT want bidirectional context for representation learning, so the mask is the identity — every token sees every other token. The asymmetry between encoder and decoder attention is the load-bearing difference between BERT-style and GPT-style architectures, and it is set by one matrix.

How does causal masking interact with the KV cache?

Because of the mask, the K and V tensors for tokens are fixed once they are computed — adding token cannot change them, since no past token attends forward. The KV cache stores these frozen K/V and is correct precisely because of causality; relax the mask and the cache becomes stale.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord