GELU

Also known as: gaussian error linear unit

TL;DR

GELU is x · Φ(x), where Φ is the standard-normal CDF. A smooth, differentiable-everywhere relative of ReLU that BERT introduced and every major transformer has used since.

GELU — the Gaussian error linear unit — is the smooth cousin of ReLU that BERT introduced in 2018 and that every major transformer architecture since has used. It looks almost identical to ReLU for large, but it’s smooth across the origin, has a small region of negative output near , and crucially has a nonzero gradient everywhere. It is the de-facto of modern language modeling.

The formula

where is the standard-normal cumulative distribution function:

Reading the formula: each input is multiplied by the probability that a standard-normal random variable is less than . For large positive , and the unit passes the input through. For large negative , and the unit zeros out. The transition is smooth — that’s the entire design.

In practice, many frameworks ship a tanh-based approximation that avoids the call:

The approximation matches the exact form within about over the range that matters. Checkpoints trained with one approximation will load and run with the other; the numerical drift is below the noise floor of fp16 inference.

Smoothness vs ReLU’s kink

ReLU has a discontinuous derivative at the origin. GELU has continuous derivatives of all orders. For first-order optimizers (Adam, SGD with momentum) the kink itself isn’t directly harmful, but two related properties of GELU help in practice:

  • There is no flat zero region. Even for , GELU’s gradient is small but nonzero — there are no permanently dead neurons.
  • The small negative output region near lets the activation distribute its mass slightly below zero, which empirically helps optimization in the early layers of deep transformers.

Where GELU lives in a transformer

Inside the of each transformer block, exactly one GELU sits between the two linear projections:

That single per-block nonlinearity is the entire elementwise nonlinear lifting the network does. contains a softmax, which is a separate vector-wise nonlinearity. Together those are the only two sources of nonlinearity in the whole transformer; everything else is matrix multiplications, additions, and normalizations.

Transformers using GELU
  • BERT — introduced GELU and is the reason it spread
  • GPT-2, GPT-3, GPT-4 — exact form in some implementations, tanh approximation in others
  • T5 — uses GELU in the encoder-decoder FFN
  • ViT and most vision transformers — GELU in the patch-embedding FFN

Llama, Mistral, and PaLM use SiLU instead — practically interchangeable with GELU. The choice between them is mostly path-dependent on which framework or paper a team copied from.

ReLU produces strictly non-negative activations. The mean activation across a layer is therefore positive, which biases the downstream linear layer’s pre-activations away from zero — a small but measurable issue for optimization. GELU’s negative region pulls the layer mean toward zero, similar to what or layer normalization do explicitly. It’s a small implicit centering effect that compounds across hundreds of millions of training steps.

Pick GELU when you’re building a transformer and you don’t have a specific reason to prefer SiLU. Pick the exact form for new code and tolerate either when loading old checkpoints.

Go further

Is the tanh approximation just for speed?

Mostly yes — older PyTorch and TensorFlow shipped a tanh approximation because computing the standard-normal CDF via was slower on early GPUs. Modern kernels usually compute the exact form directly. The approximation differs from exact GELU by at most across the relevant range; the choice has no measurable effect on training but means model checkpoints can disagree numerically by a tiny amount if you cross frameworks.

GELU or SiLU — does it matter which?

Empirically, no. SiLU and GELU are nearly identical curves; the two gating functions and are both monotonic S-curves through the origin. BERT and GPT use GELU, Llama and Mistral use SiLU, and reported quality differences fall inside noise. SiLU is marginally cheaper to compute.

Why is smoothness worth caring about?

ReLU has a kink at zero — its derivative is discontinuous there, and the second derivative is a Dirac delta. For first-order methods like Adam this doesn't matter directly, but the smooth GELU empirically produces marginally better optimization curves and avoids the small fraction of dead-ReLU units. Second-order methods and certain calibration tricks also prefer continuously differentiable activations.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord