Activation Function

Q: Why does no activation mean no expressivity?

Two stacked linear layers FORMULA are a single linear layer. Without nonlinearities, depth gives you nothing — the entire network reduces to one matrix multiplication, which can only represent linear functions of the input. The activation is the entire reason depth helps.

Also known as: nonlinearity, transfer function

TL;DR

An activation function is the elementwise nonlinearity sandwiched between the linear layers of a neural network. Without it, the whole network collapses to a single linear map.

An activation function is an elementwise nonlinearity applied between the linear layers of a neural network. It takes a vector and returns a vector of the same shape, with some scalar function applied to each entry independently. Without it, stacking linear layers buys you nothing — the whole network mathematically collapses to a single linear map. The activation is the entire reason depth produces expressive function classes.

Why no activation means no expressivity

Two stacked linear layers without a nonlinearity in between:

is just one linear layer with weight . Stack ten of them and you still have a linear map — straight lines and hyperplanes only. Inserting a single nonlinearity per layer is what turns the network into a universal approximator.

The common choices

Activations you'll meet in practice

ReLU — . Simple, fast, the default from 2012 to ~2018. Hard threshold at zero.
GELU — where is the standard-normal CDF. Smooth approximation of ReLU. The transformer default.
SiLU / Swish — . Very close to GELU; cheaper to compute. Used in Llama, Mistral.
Tanh — . Pre-2012 default. Outputs in . Lives on in RNN gates.
Sigmoid — . Outputs in . Survives in gates and binary classifier heads.

The shift from ReLU to GELU/SiLU was empirical — every paper that benchmarked them on transformer training found a small but consistent improvement, with no real cost.

Where activations sit in a transformer

A transformer block has two main sub-layers: attention and the feedforward network . The activation lives inside the feedforward — exactly one nonlinearity, between the two linear projections:

That single GELU does the entire elementwise nonlinear lifting per block. Attention contains a softmax , a vector-wise nonlinearity. These two — the per-block GELU and the per-attention softmax — are the only sources of nonlinearity in the whole transformer. Everything else is matrix multiplications, additions, and normalizations.

Every few years a paper proposes a new activation — Mish, ELU, GLU variants, learned activations, polynomial activations. They show modest improvements on a benchmark and then disappear. The reason: GELU and SiLU are good enough that any improvement is dominated by other choices (data, model size, optimizer). The marginal value of activation engineering is tiny once you’re at the GELU/SiLU plateau. The one exception is the GLU family — gated linear units like SwiGLU (used in Llama) split the FFN into two parallel projections and gate them, which has produced consistent improvements at scale and is now standard in newer architectures.

Pick GELU or SiLU, put it inside the feedforward, and stop thinking about it.

Go further

Why does no activation mean no expressivity?

Two stacked linear layers are a single linear layer. Without nonlinearities, depth gives you nothing — the entire network reduces to one matrix multiplication, which can only represent linear functions of the input. The activation is the entire reason depth helps.

Feedforward network

Why did GELU replace ReLU in transformers?

GELU is smooth, differentiable everywhere, and has a small region of negative output near zero — which empirically produces slightly better optimization in transformers than ReLU's hard cutoff. Every major transformer since BERT has used GELU or its close cousins SiLU/Swish. The differences are real but small; the field converged for stability reasons more than dramatic gains.

Transformer

What's the dying-ReLU problem?

If a ReLU neuron's input becomes consistently negative during training, its output is zero, its gradient is zero, and it never updates again — it's permanently dead. Large fractions of dead neurons can hurt capacity. GELU and SiLU avoid this because their gradient is nonzero everywhere, even for negative inputs.

Feedforward network

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs