Tanh

Also known as: hyperbolic tangent

TL;DR

Tanh maps any real number into the open interval (−1, 1). A zero-centered sibling of sigmoid that ruled hidden layers before ReLU, and that still lives in RNN cells, attention temperature tricks, and GELU's tanh approximation.

Tanh — the hyperbolic tangent — is the zero-centered cousin of sigmoid. It squashes any real input into , has the same S-curve shape, and shares the same saturation behavior at large . It was the standard hidden-layer from the late 1980s through 2011, then lost the throne to ReLU. It still earns its keep inside recurrent cells, in the GELU approximation, and anywhere a bounded zero-centered nonlinearity matters.

The formula

The function is odd: . It passes through the origin, asymptotes to as and to as , and is steepest at where its derivative is exactly .

The derivative has the same self-referential form as sigmoid’s:

which means once you’ve computed the forward pass, the backward pass costs one multiply and one subtract.

Why “zero-centered” matters

The single most cited advantage of tanh over is that its output range is symmetric around zero. Sigmoid’s output sits in with mean roughly , which means the input to the next linear layer has a systematically positive mean. The gradient updates that follow develop a directional bias — they push all weights toward the same sign each step, which manifests as slow, zig-zagging optimization.

Tanh’s outputs sit in with mean closer to zero. Successive layers receive zero-centered inputs, gradients point in useful directions, and the optimization runs more cleanly. The effect is real but not enormous; it was a significant enough win in the 1990s and 2000s that tanh became the default, but ReLU’s gradient-flow advantage eventually outweighed it.

The relationship to sigmoid

The two are linearly related:

So tanh is a rescaled, recentered sigmoid. They share the same S-shape, the same exponential cost, and — critically — the same vanishing-gradient pathology. For , both derivatives are below , and a deep stack of either activation cannot pass gradient signal through to early layers.

Where tanh still lives

Modern places you'll still find tanh
  • LSTM cells — the candidate cell state is . Bounded and zero-centered is exactly what you want for a recurrent state.
  • GRU cells — same role in the candidate-hidden-state computation.
  • GELU approximation — the practical form is the most-deployed activation in transformer history.
  • Soft-clipping attention logits — some training-stability tricks pass attention scores through to bound them before the softmax.
  • Output heads with bounded targets — regression tasks where the label is in (normalized control signals, certain audio features) often end in a tanh head.

The pattern is consistent: tanh appears whenever a model needs a bounded, symmetric, smooth signal — and only in one or two places, never stacked deep enough for vanishing gradients to bite.

The tanh form of GELU is . The polynomial inside the tanh was fit to approximate the standard-normal CDF, and tanh is the natural choice for that fit because its derivative behaves identically to the CDF’s near the origin. A sigmoid-based approximation exists too — Swish/SiLU is essentially “the GELU approximation with the polynomial replaced by the identity” — but historically the GELU paper picked tanh and the framework code inherited that choice.

If you’re writing a new feedforward network in 2026, you almost certainly want ReLU, GELU, or SiLU instead of tanh. If you’re working on an RNN or any model where a state needs to live in a symmetric bounded interval, tanh is still the natural choice.

Go further

Why is tanh just a rescaled sigmoid?

The identity shows that tanh is a sigmoid with the input scaled by 2 and the output shifted and stretched into . The same S-curve, the same saturation problem at large , just centered at zero instead of at .

Why is zero-centered output an advantage?

Sigmoid's output mean is around , which biases the downstream linear layer's pre-activations away from zero — gradients then have a systematic positive component that slows optimization. Tanh's output mean is around zero, so gradient updates point in genuinely useful directions. The effect is small but was enough that tanh was the standard hidden-layer activation for years before ReLU.

Where does tanh still appear in modern models?

LSTM and GRU cells use tanh for the candidate-state nonlinearity. The GELU approximation uses a tanh. Some attention variants apply tanh as a soft clip on attention logits to prevent extreme values. And many older normalization layers and output heads that need bounded activations still reach for it.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord