Is the tanh approximation just for speed?
Mostly yes — older PyTorch and TensorFlow shipped a tanh approximation because computing the standard-normal CDF
Also known as: gaussian error linear unit
GELU is x · Φ(x), where Φ is the standard-normal CDF. A smooth, differentiable-everywhere relative of ReLU that BERT introduced and every major transformer has used since.
GELU — the Gaussian error linear unit — is the smooth cousin of ReLU that BERT introduced in 2018 and that every major transformer architecture since has used. It looks almost identical to ReLU for
where
Reading the formula: each input
In practice, many frameworks ship a tanh-based approximation that avoids the
The approximation matches the exact form within about
ReLU has a discontinuous derivative at the origin. GELU has continuous derivatives of all orders. For first-order optimizers (Adam, SGD with momentum) the kink itself isn’t directly harmful, but two related properties of GELU help in practice:
Inside the feedforward network of each transformer block, exactly one GELU sits between the two linear projections:
That single per-block nonlinearity is the entire elementwise nonlinear lifting the network does. Attention contains a softmax, which is a separate vector-wise nonlinearity. Together those are the only two sources of nonlinearity in the whole transformer; everything else is matrix multiplications, additions, and normalizations.
Llama, Mistral, and PaLM use SiLU instead — practically interchangeable with GELU. The choice between them is mostly path-dependent on which framework or paper a team copied from.
ReLU produces strictly non-negative activations. The mean activation across a layer is therefore positive, which biases the downstream linear layer’s pre-activations away from zero — a small but measurable issue for optimization. GELU’s negative region pulls the layer mean toward zero, similar to what batch normalization or layer normalization do explicitly. It’s a small implicit centering effect that compounds across hundreds of millions of training steps.
Pick GELU when you’re building a transformer and you don’t have a specific reason to prefer SiLU. Pick the exact form for new code and tolerate either when loading old checkpoints.
Mostly yes — older PyTorch and TensorFlow shipped a tanh approximation because computing the standard-normal CDF
Empirically, no. SiLU
ReLU has a kink at zero — its derivative is discontinuous there, and the second derivative is a Dirac delta. For first-order methods like Adam this doesn't matter directly, but the smooth GELU empirically produces marginally better optimization curves and avoids the small fraction of dead-ReLU units. Second-order methods and certain calibration tricks also prefer continuously differentiable activations.