SiLU

Also known as: swish, sigmoid linear unit, swish-1

TL;DR

SiLU is x · σ(x): the input gated by its own sigmoid. Originally proposed as Swish, now standard in Llama, Mistral, and most modern open-weight transformers. Practically indistinguishable from GELU.

SiLU — the sigmoid linear unit, also known as Swish — gates each input by its own . The function shape is nearly identical to , the math is simpler, and it has quietly become the default inside most modern open-weight transformers — Llama, Mistral, Falcon, Gemma, and PaLM all use it.

The formula

The interpretation is “self-gating”: multiply the input by its own sigmoid value, which acts as a soft, learned gate. For large positive , and SiLU . For large negative , and SiLU . In between, the multiplication produces a smooth curve that dips slightly below zero around and recovers monotonically.

The derivative is a clean expression of the same idea:

It’s nonzero everywhere — no dying-unit problem — and approaches for and for , just like GELU.

A brief history

The function was first written down by Hendrycks and Gimpel in the 2016 GELU paper as the sigmoid linear unit. Ramachandran, Zoph, and Le rediscovered it independently in 2017 via a neural-architecture search at Google and named it Swish, briefly preferring a parameterized form . With the two are identical, and the names converged on SiLU once PyTorch standardized it.

PaLM (2022) used SwiGLU — a gated-linear-unit variant with SiLU as the gate — and reported small but consistent gains over plain GELU-FFN. Llama copied that choice, Mistral followed Llama, and the open-weight ecosystem inherited the lineage.

SiLU vs GELU

The two curves are visually almost indistinguishable. Both pass through the origin, both have a small negative-output dip near , both saturate to the identity for large positive and to zero for large negative . Quantitatively the largest absolute difference is around , which sits well below any practical training noise.

Where SiLU lives in modern architectures

Modern transformers using SiLU
  • Llama 1, 2, 3 — SiLU inside SwiGLU in every FFN block.
  • Mistral and Mixtral — same SwiGLU pattern.
  • PaLM and PaLM 2 — original SwiGLU users at scale.
  • Falcon, Gemma, Qwen — SiLU-based FFNs.
  • EfficientNet and related vision models — used Swish before the LLM era picked it up.

The standard SwiGLU FFN replaces the single up-projection of a vanilla feedforward network with two parallel projections — one acts as the value path, the other is squashed through SiLU and gates the first:

The extra projection costs parameters but produces consistent quality gains at scale, and it’s the form most teams now copy.

Inside a transformer FFN — almost certainly not. Pick SiLU if you’re following the Llama lineage, GELU if you’re following the BERT/GPT lineage, and either is fine. Outside transformers (convnets, RNNs, recommender systems), the answer is more nuanced: ReLU is still common, and the marginal value of switching is small. The much more important questions are model size, data, optimizer, and learning-rate schedule.

If you’re building a new transformer FFN, SiLU and GELU are both safe defaults. If you’re matching a specific model family, copy whatever it uses.

Go further

Why two names — SiLU and Swish?

Hendrycks and Gimpel proposed in 2016 inside the GELU paper, calling it the sigmoid linear unit. Ramachandran et al. at Google rediscovered it via neural architecture search in 2017 and named it Swish, then later proposed a parameterized version . With the two names refer to the same function. SiLU is the name PyTorch and modern papers settled on.

Why did Llama pick SiLU over GELU?

Llama uses SwiGLU — the gated-linear-unit variant with SiLU as the gating nonlinearity. SwiGLU consistently outperforms plain GELU-FFN at scale, and the SiLU choice inside it is mostly inherited from the PaLM lineage. SiLU is also marginally cheaper than the exact-erf GELU: one sigmoid instead of one CDF evaluation.

Is SiLU self-gating intuition real?

Mathematically yes — acts as a soft mask on itself. The unit decides 'how much of myself to let through' based on its own value. For large positive the gate is open and SiLU ; for large negative the gate is closed and SiLU ; near zero the gate is partial and produces the small smooth dip below zero that distinguishes SiLU from ReLU.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord