Why two names — SiLU and Swish?
Hendrycks and Gimpel proposed
Also known as: swish, sigmoid linear unit, swish-1
SiLU is x · σ(x): the input gated by its own sigmoid. Originally proposed as Swish, now standard in Llama, Mistral, and most modern open-weight transformers. Practically indistinguishable from GELU.
SiLU — the sigmoid linear unit, also known as Swish — gates each input by its own sigmoid . The function shape is nearly identical to GELU , the math is simpler, and it has quietly become the default activation inside most modern open-weight transformers — Llama, Mistral, Falcon, Gemma, and PaLM all use it.
The interpretation is “self-gating”: multiply the input by its own sigmoid value, which acts as a soft, learned gate. For large positive
The derivative is a clean expression of the same idea:
It’s nonzero everywhere — no dying-unit problem — and approaches
The function was first written down by Hendrycks and Gimpel in the 2016 GELU paper as the sigmoid linear unit. Ramachandran, Zoph, and Le rediscovered it independently in 2017 via a neural-architecture search at Google and named it Swish, briefly preferring a parameterized form
PaLM (2022) used SwiGLU — a gated-linear-unit variant with SiLU as the gate — and reported small but consistent gains over plain GELU-FFN. Llama copied that choice, Mistral followed Llama, and the open-weight ecosystem inherited the lineage.
The two curves are visually almost indistinguishable. Both pass through the origin, both have a small negative-output dip near
The standard SwiGLU FFN replaces the single up-projection of a vanilla feedforward network with two parallel projections — one acts as the value path, the other is squashed through SiLU and gates the first:
The extra projection costs parameters but produces consistent quality gains at scale, and it’s the form most teams now copy.
Inside a transformer FFN — almost certainly not. Pick SiLU if you’re following the Llama lineage, GELU if you’re following the BERT/GPT lineage, and either is fine. Outside transformers (convnets, RNNs, recommender systems), the answer is more nuanced: ReLU is still common, and the marginal value of switching is small. The much more important questions are model size, data, optimizer, and learning-rate schedule.
If you’re building a new transformer FFN, SiLU and GELU are both safe defaults. If you’re matching a specific model family, copy whatever it uses.
Hendrycks and Gimpel proposed
Llama uses SwiGLU — the gated-linear-unit variant with SiLU as the gating nonlinearity. SwiGLU consistently outperforms plain GELU-FFN at scale, and the SiLU choice inside it is mostly inherited from the PaLM lineage. SiLU is also marginally cheaper than the exact-erf GELU: one sigmoid instead of one CDF evaluation.
Mathematically yes —