Feedforward Network

Also known as: FFN, MLP, position-wise feedforward, two-layer MLP

TL;DR

The feedforward network — the MLP — is the per-position sub-layer that sits next to attention in every transformer block. Two linear layers with an activation in between, applied independently to each token's hidden state.

The feedforward network — also called the FFN or MLP — is the second of the two sub-layers in every block. It takes each token’s hidden-state , projects it up to a wider dimension, applies an , and projects it back down. It does this independently for every token. It is also the sub-layer where most of the transformer’s parameters live.

FEEDFORWARD NETWORKWide in the middle. Where most params live.INPUT · xdim dHIDDEN · hdim 4 dOUTPUT · ydim dW₁d × 4dW₂4d × dGELUPARAMETER ACCOUNTING · PER BLOCKattention4 d²feedforward8 d² · 2 × bigger~⅔ of all paramslive in the FFN

The shape

For a hidden dimension and an inner dimension (typically ):

with and . Two linear layers, one nonlinearity in between. Applied to every position’s hidden state independently — no mixing across tokens. That’s what makes it “position-wise.”

The 4× expansion is doctrine. Inner dimension 4× the model dimension was set in the original 2017 transformer paper and has remained the default since.

Why the FFN holds most of the parameters

Per transformer block, the parameter accounting is roughly:

  • Attention: parameters (Q, K, V, output projections).
  • FFN: parameters (two projections, expand and contract).

So for every transformer block, the FFN has about 2× the parameters of attention. Across the whole model, this means roughly two-thirds of the parameter count is in the FFN. This is the load-bearing fact behind several scaling tricks.

What the FFN actually computes

Mechanistically, the most popular interpretation is that the FFN is a key-value memory. The first projection acts as a set of “keys” — each row is a pattern detector, and the activation function lights up the rows whose pattern matches the current hidden state. The second projection acts as the corresponding “values” — each column is a contribution that gets added to the output proportional to how strongly its key fired.

This perspective comes from work like Geva et al. (2021) showing that individual FFN neurons in trained transformers correspond to recognizable input patterns, and mechanistic-interpretability work building on it. It’s not the only story, but it’s the cleanest one for understanding what the FFN does on top of attention.

Where the FFN matters in production
  • Most parameters of every dense LLM
  • The natural target for mixture-of-experts replacement
  • The natural target for sparsity / pruning research
  • Where SwiGLU, GeGLU and other GLU variants apply
  • The primary path for “knowledge” storage in mechanistic-interpretability work

SwiGLU splits the first projection in two and gates one with the other:

Three matrices instead of two, with elementwise multiplication () gating one stream by the other. Llama, Mistral, and most newer open models use this. To keep the parameter count constant after splitting, the inner dimension shrinks from to . The empirical gain is small (roughly 0.3% on validation loss at scale) but consistent enough that it became standard. The intuition: gating gives the FFN a multiplicative interaction in addition to the additive one, which expands the function class without much extra compute.

Nearly every scaling and sparsity story in modern LLMs — MoE, pruning, GLU variants, quantization budgets — is really a story about the FFN, because that’s where the parameters are.

Go further

Why is the FFN four times the model dimension?

Empirically, the 4× expansion ratio gives the best capacity per parameter for a fixed compute budget. It's been the default since the original transformer paper. Newer architectures with GLU variants (SwiGLU in Llama) use a 2/3 × 4 = 8/3 expansion to keep parameter count constant after splitting the projection — but the doctrine is the same: expand wide, apply a nonlinearity, project back.

Where do most of a transformer's parameters live?

In the feedforward layers. Roughly two-thirds of the parameters in a typical transformer block are in the FFN, not attention. This is why mixture-of-experts replaces the FFN with multiple parallel FFNs and routes — that's where the parameters are, so that's where you can scale capacity without scaling compute per token.

Why is the FFN 'position-wise' — what does that mean?

The same FFN weights are applied independently to each token's hidden state. There's no mixing across positions inside the FFN — that's what attention is for. So the FFN is really a bank of N independent two-layer MLPs, sharing weights, processing N tokens in parallel.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord