Why did sigmoid lose to ReLU in hidden layers?
The gradient
Also known as: logistic function, logistic sigmoid
The sigmoid σ(x) = 1/(1 + e⁻ˣ) squashes any real number into the open interval (0, 1). It was the default neural-network nonlinearity for decades and still survives wherever you need a probability or a gate.
The sigmoid — or logistic — function takes any real number and squashes it into the open interval
Three properties make this shape useful. It is bounded — outputs always live in
The derivative can be computed straight from the output, with no extra exponentials, which is why sigmoid was beloved in the era when every multiplication cost real wall-clock time.
Read
Paired with binary cross-entropy, the gradient with respect to the logit reduces to the clean
Sigmoid is also why training deep networks was hard for decades. The derivative
In a deep network, the chain rule multiplies these small gradients together at every layer. Ten stacked sigmoids give you at best
The pattern is consistent: sigmoid appears wherever the network needs one of “is this thing on?”, “by how much?”, or “what’s the probability?”, with no more than one or two in series so the vanishing-gradient pathology never gets a chance to compound.
The identity
If you find yourself reaching for sigmoid in a hidden layer of a feedforward network, you almost certainly want ReLU, GELU, or SiLU instead. If you’re producing a probability or gating a multiplier, sigmoid is still the right tool.
The gradient
Anywhere you need a gate (a soft, differentiable on/off): LSTM input/forget/output gates, attention masks in some architectures, learned mixture weights. Also the standard output nonlinearity for a binary-classification head, where the value is read as
Logistic regression is literally a single linear layer followed by a sigmoid, trained with binary cross-entropy. A neural network with a sigmoid output head and one hidden layer is logistic regression on learned features. The continuity from the simplest classifier to deep nets runs straight through this nonlinearity.