Why does no activation mean no expressivity?
Two stacked linear layers
Also known as: nonlinearity, transfer function
An activation function is the elementwise nonlinearity sandwiched between the linear layers of a neural network. Without it, the whole network collapses to a single linear map.
An activation function is an elementwise nonlinearity applied between the linear layers of a neural network. It takes a vector and returns a vector of the same shape, with some scalar function applied to each entry independently. Without it, stacking linear layers buys you nothing — the whole network mathematically collapses to a single linear map. The activation is the entire reason depth produces expressive function classes.
Two stacked linear layers without a nonlinearity in between:
is just one linear layer with weight
The shift from ReLU to GELU/SiLU was empirical — every paper that benchmarked them on transformer training found a small but consistent improvement, with no real cost.
A transformer block has two main sub-layers: attention and the feedforward network . The activation lives inside the feedforward — exactly one nonlinearity, between the two linear projections:
That single GELU does the entire elementwise nonlinear lifting per block. Attention contains a softmax , a vector-wise nonlinearity. These two — the per-block GELU and the per-attention softmax — are the only sources of nonlinearity in the whole transformer. Everything else is matrix multiplications, additions, and normalizations.
Every few years a paper proposes a new activation — Mish, ELU, GLU variants, learned activations, polynomial activations. They show modest improvements on a benchmark and then disappear. The reason: GELU and SiLU are good enough that any improvement is dominated by other choices (data, model size, optimizer). The marginal value of activation engineering is tiny once you’re at the GELU/SiLU plateau. The one exception is the GLU family — gated linear units like SwiGLU (used in Llama) split the FFN into two parallel projections and gate them, which has produced consistent improvements at scale and is now standard in newer architectures.
Pick GELU or SiLU, put it inside the feedforward, and stop thinking about it.
Two stacked linear layers
GELU is smooth, differentiable everywhere, and has a small region of negative output near zero — which empirically produces slightly better optimization in transformers than ReLU's hard cutoff. Every major transformer since BERT has used GELU or its close cousins SiLU/Swish. The differences are real but small; the field converged for stability reasons more than dramatic gains.
If a ReLU neuron's input becomes consistently negative during training, its output is zero, its gradient is zero, and it never updates again — it's permanently dead. Large fractions of dead neurons can hurt capacity. GELU and SiLU avoid this because their gradient is nonzero everywhere, even for negative inputs.