ReLU

Also known as: rectified linear unit, rectifier

TL;DR

ReLU is max(0, x) — pass positive inputs through, clamp negatives to zero. The cheap, sharp nonlinearity that made training deep networks finally work, and the dominant hidden-layer activation from 2012 until transformers switched to GELU.

ReLU — the rectified linear unit — is the simplest useful anyone has ever found. It passes positive inputs through unchanged, and clamps everything else to zero. That triviality is the entire point: it’s cheap to compute, its gradient doesn’t vanish, and it made training networks with more than three layers finally work.

The formula

That’s the whole thing. No exponentials, no CDFs, no learned parameters. Branch on the sign of ; pass or zero. The derivative is the indicator function 𝟙 — a step at the origin, undefined exactly at zero (in practice you pick a subgradient, almost always 0).

Why it dominated 2012-2020

Before ReLU, the default activations were sigmoid and . Both saturate — their gradients flatten to zero for large , and the chain rule through ten layers compounds that into a gradient signal indistinguishable from noise. This is the vanishing-gradient problem, and it kept neural networks shallow.

ReLU has gradient exactly 1 wherever the unit is active. Chain-rule through ten active ReLUs and you still have a gradient of 1. That single property — a flat, non-decaying gradient through the active path — is why AlexNet (2012) trained, why ResNets (2015) scaled to a hundred layers, and why every published convnet between then and the rise of transformers used ReLU somewhere in its stack.

The dying-ReLU problem

The other side of the hard zero: any neuron whose pre-activation drifts consistently negative produces zero output and zero gradient, forever. There is no signal to push it back into the active region. The unit is dead, and that dead capacity stays dead for the rest of training.

In practice, large fractions of a layer can die — particularly with high learning rates or poor initialization. The network keeps training fine but with reduced effective width. You typically discover the problem by inspecting activation statistics post-training and finding that many units are zero on every input.

The mitigations are all variations on “don’t actually clamp to zero”:

Cousins of ReLU that fix the dying problem
  • Leaky ReLU with . A small slope for negative inputs keeps the gradient alive.
  • PReLU — the same shape, but is a learned per-channel parameter.
  • ELU for , for . Smooth, with mean output closer to zero.
  • GELU. Smooth everywhere, slight negative region. The transformer default.
  • SiLU / Swish. Very close to GELU. Used in Llama and Mistral.

When to still use ReLU

ReLU is still the right default for many production systems: convolutional networks for vision, tabular feedforward networks, recommender embeddings, and anywhere inference latency matters more than the last 0.3% of validation loss. A single max is the cheapest nonlinearity that exists; GELU’s CDF or SiLU’s sigmoid are real cycles on a hot path.

In transformers specifically, GELU and SiLU now dominate — every major LLM since BERT has used one of them. But “use GELU or SiLU” is a transformer-specific result, not a general one. Outside transformers, ReLU still works fine and remains popular.

Roughly half of all inputs to a ReLU are negative (assuming roughly zero-centered pre-activations), and they all map to exactly zero. The output of a ReLU layer is therefore approximately 50% zeros. This natural sparsity acts as a mild regularizer and matches some hand-engineered priors from earlier feature-learning work. Sigmoid and tanh produce dense outputs (no entries are exactly zero), which is part of why their representations were harder to interpret and slower to train.

If you need the cheapest useful activation, reach for ReLU. If you’re building a transformer, reach for GELU or SiLU. Either way, the lineage runs straight back through ReLU.

Go further

What is the dying-ReLU problem and how is it mitigated?

If a ReLU neuron's pre-activation becomes consistently negative during training, its output is zero, its gradient is zero, and it never updates again — it's permanently dead. Large fractions of dead neurons cut effective capacity. Mitigations include Leaky ReLU (a small slope for ), PReLU (the slope is learned), and switching to GELU or SiLU, whose gradient is nonzero everywhere.

Why did transformers switch from ReLU to GELU?

The original Vaswani et al. transformer used ReLU. BERT switched to GELU, and the field followed. GELU is smooth, has a nonzero gradient for negative inputs, and consistently shaves a small amount off training loss with no downside. The differences are small but the field converged for optimization stability, not dramatic gains.

Is ReLU completely obsolete now?

No — it's still the default in convolutional networks, recommender models, and many production systems where the simplicity wins. GELU/SiLU dominate transformers specifically. ReLU is also still the cheapest activation by a wide margin: a single comparison and a conditional move, with no exponential or CDF approximation involved.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord