Sigmoid

Q: Why did sigmoid lose to ReLU in hidden layers?

The gradient FORMULA peaks at FORMULA and decays toward zero on both ends. Stack ten sigmoid layers and the chain-ruled gradient through them is at best FORMULA — the early layers barely move. ReLU's gradient is 1 wherever the unit is active, which is what made training deep networks finally work.

Also known as: logistic function, logistic sigmoid

TL;DR

The sigmoid σ(x) = 1/(1 + e⁻ˣ) squashes any real number into the open interval (0, 1). It was the default neural-network nonlinearity for decades and still survives wherever you need a probability or a gate.

The sigmoid — or logistic — function takes any real number and squashes it into the open interval . It was the default activation function in neural networks for roughly thirty years, and although ReLU and GELU have replaced it inside modern hidden layers, sigmoid survives anywhere a model needs to emit a probability or operate a soft gate.

The formula

Three properties make this shape useful. It is bounded — outputs always live in . It is smooth and monotonic — every increase in produces a strict increase in . And it has a tidy derivative:

The derivative can be computed straight from the output, with no extra exponentials, which is why sigmoid was beloved in the era when every multiplication cost real wall-clock time.

Why it’s natural for binary classification

Read as the probability of the positive class and the formula becomes the canonical link function of logistic regression. The input is the model’s confidence in log-odds units (a logit ), and converts that back to a probability:

Paired with binary cross-entropy, the gradient with respect to the logit reduces to the clean form — the same trick that makes softmax plus categorical cross-entropy so well-behaved. Binary-classification heads in modern models almost all end in a sigmoid for exactly this reason.

The vanishing-gradient problem

Sigmoid is also why training deep networks was hard for decades. The derivative has a maximum value of at , and decays toward zero as grows. Once the pre-activation drifts into the saturated region (roughly ), the gradient is essentially zero — the neuron stops learning.

In a deep network, the chain rule multiplies these small gradients together at every layer. Ten stacked sigmoids give you at best ; in practice it’s much worse because the units saturate. The early layers receive a gradient signal indistinguishable from noise and never update. This is the vanishing-gradient problem, and it kept neural networks shallow for years.

Where it still lives

Modern uses of sigmoid

Binary classification heads — a single output neuron probability of the positive class.
LSTM and GRU gates — input, forget, and output gates each multiply their argument by a sigmoid that decides how much to keep.
Attention gates — some architectures (gated attention units, certain MoE routers) use sigmoid for soft selection.
The SiLU activation — uses sigmoid as a self-gating signal inside the activation itself.
Mixture weights and soft masks — any time a model needs a continuous multiplier learned end-to-end.

The pattern is consistent: sigmoid appears wherever the network needs one of “is this thing on?”, “by how much?”, or “what’s the probability?”, with no more than one or two in series so the vanishing-gradient pathology never gets a chance to compound.

The identity shows that tanh is just a rescaled, recentered sigmoid. Tanh’s output is in with zero-centered range, which makes it better-behaved in hidden layers than sigmoid because the activations have mean closer to zero. Both share the same vanishing-gradient problem in deep stacks; tanh just buys you a constant factor before saturation bites.

If you find yourself reaching for sigmoid in a hidden layer of a feedforward network, you almost certainly want ReLU, GELU, or SiLU instead. If you’re producing a probability or gating a multiplier, sigmoid is still the right tool.

Go further

Why did sigmoid lose to ReLU in hidden layers?

The gradient peaks at and decays toward zero on both ends. Stack ten sigmoid layers and the chain-ruled gradient through them is at best — the early layers barely move. ReLU's gradient is 1 wherever the unit is active, which is what made training deep networks finally work.

Activation function

Where does sigmoid still earn its keep?

Anywhere you need a gate (a soft, differentiable on/off): LSTM input/forget/output gates, attention masks in some architectures, learned mixture weights. Also the standard output nonlinearity for a binary-classification head, where the value is read as . The vanishing-gradient problem doesn't bite there because the network only has one or two sigmoids in series, not ten.

Softmax

What's the connection to logistic regression?

Logistic regression is literally a single linear layer followed by a sigmoid, trained with binary cross-entropy. A neural network with a sigmoid output head and one hidden layer is logistic regression on learned features. The continuity from the simplest classifier to deep nets runs straight through this nonlinearity.

Logits

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs