Also known as: skip connection, residual stream, shortcut connection
TL;DR
A residual connection adds a layer's input to its output, so each block computes an update on top of a running 'residual stream' rather than transforming the representation from scratch.
A residual connection is a learnable layer combined with a parallel identity shortcut: instead of , the layer computes . The output is the input plus an update. In a transformer , every attention sublayer and every feed-forward sublayer is wrapped in one. The cumulative effect — adding hundreds of small updates to a single running representation — is the structure that makes deep transformers learnable.
A transformer is best understood as a single residual stream that every sublayer reads from and writes to — not as a stack of distinct transformations.
The math, in one place
A standard pre-norm transformer block:
The variable — sometimes called the residual stream — passes through the entire stack with each block adding to it. After layers:
where each is the update produced by one sublayer. The final hidden state is the initial token embedding plus all the increments.
Why deep nets need them
Without skip connections, a stack of nonlinear layers computes . Two failure modes appear at depth:
Vanishing / exploding gradients. During backpropagation, the gradient at layer 1 multiplies Jacobians together. Even slightly small or large eigenvalues compound exponentially.
Information loss. Each layer transforms its input. By layer 40, the original input is gone — anything the model wanted to preserve had to be carried forward through every transformation.
Residual connections fix both. The gradient flowing back from layer to layer 1 is the sum of all paths, including the direct identity-only path that has Jacobian 1. And information from early layers is automatically available to late layers because the residual stream is the same vector throughout — each block adds to it but doesn’t replace it.
The residual-stream view
A productive way to think of a transformer: there’s a single latent vector per position that evolves through the layers (the residual stream). Each attention head and feed-forward sublayer is a read-then-write operation on this stream. Reads happen via the LayerNorm input projection; writes happen via the additive output. Different components specialize for different reads and writes — some heads detect coreference, some heads detect syntax, some FFN neurons store factual associations.
Mechanistic interpretability research (induction heads, indirect object identification circuits, knowledge-storage findings) is built on this view. The whole concept of a circuit — a chain of components whose reads and writes line up — only makes sense because the residual stream is preserved across the depth of the model.
The original transformer (Vaswani et al., 2017) was post-norm: . The LayerNorm sat outside the residual stream, normalizing every block’s output. This works at moderate depth but trains catastrophically at scale — gradient norms explode through the unnormalized residual sum, and warmup schedules become a knife-edge.
Pre-norm — — moves the normalization inside the residual block, applied to the input of rather than the output. Now the residual stream itself is never normalized; only the inputs to attention and FFN get their statistics fixed. The result is a stable identity path from layer 1 to layer L, which is what lets GPT-style models train at 100+ layers without instability tricks. Every modern LLM is pre-norm.
Because the additive structure requires shape compatibility — you can’t add two vectors of different dimensions. So is fixed across the entire stack: token embeddings, every block’s input and output, and the final unembedding all use the same dimension (4096 for Llama-3-8B, 8192 for Llama-3-70B, 12288 for GPT-3 175B).
This is a real constraint. Architectures that want bottlenecks (encoder-decoder MoE, vision transformers with downsampling) have to either drop residual connections at boundary layers or insert projection layers to bridge dimensions. The cost of breaking the residual stream is the cost of breaking the gradient highway, which is why almost no modern decoder-only architecture does it.
The cost
One elementwise addition per sublayer. That’s the whole cost. The dimension of the residual stream stays fixed (typically or for modern LLMs), so every block’s input and output have the same shape — which is what lets you stack arbitrarily many of them. Strip residual connections from a Llama-3-70B and training fails well before reaching usable quality.
Go further
Why does the residual stream interpretation matter?
It reframes a transformer as a sequence of incremental edits to a shared latent — like a working memory that each block reads and writes. Mechanistic interpretability research leans heavily on this view: components like attention heads can be analyzed as readers and writers on the residual stream, with circuits formed by sequences of compatible reads and writes.
Two failures. Gradients vanish through deep stacks because each layer's Jacobian shrinks the signal. And representations get progressively smeared — by layer 40, the original input information is gone. Both kill deep training. Residual connections fix both at the cost of one elementwise add per layer.
ResNet (He et al., 2015) for image classification — they let the team train 152-layer networks where 30-layer plain networks were already failing. The transformer adopted them from day one, and they've been universal across architectures ever since. The single most important architectural primitive after attention itself.