What does the mask actually do mechanically?
Pre-softmax, set the attention scores at future positions to
Also known as: causal attention, masked attention, autoregressive mask, triangular mask, causal mask
Causal masking is the lower-triangular attention mask that prevents each token from seeing tokens to its right. It is the architectural commitment that makes a transformer autoregressive — the load-bearing difference between encoder and decoder attention.
Causal masking sets the (query, key) attention score to
For a sequence of length
Masked attention is then
This is the architectural commitment that turns a bidirectional transformer into an autoregressive one. Without it, each position sees both past and future during training — the BERT regime, masked-token reconstruction over filled-in spans. With it, each position predicts only the next token from its prefix — the GPT regime, next-token prediction over the whole sequence. Same transformer block; different mask; different family of models.
Causality is what makes the KV cache correct. Past K and V tensors never change as new tokens arrive — token
With causal masking, an entire sequence trains in one forward pass: the
Causal masking is the single load-bearing line of code that separates a generative LLM from a bidirectional encoder. The KV cache, the training-inference parallelism asymmetry, the encoder/decoder split — all of it follows from that one
Softmax is computed via the log-sum-exp identity
In floating point, implementations use a large negative constant like
Pre-softmax, set the attention scores at future positions to
Encoder models like BERT want bidirectional context for representation learning, so the mask is the identity — every token sees every other token. The asymmetry between encoder and decoder attention is the load-bearing difference between BERT-style and GPT-style architectures, and it is set by one matrix.
Because of the mask, the K and V tensors for tokens