Mixture of Experts (MoE)

Also known as: MoE, sparse mixture of experts, sparse MoE

TL;DR

An architecture that replaces the dense feed-forward layer in a transformer with a sparse routing layer over many expert subnetworks — each token activates only a few experts.

A mixture-of-experts (MoE) transformer replaces the dense feed-forward (FFN) layer in each transformer block with a sparse mixture: FFN “experts” plus a small router that, for each token, picks of them to actually compute.

MIXTURE OF EXPERTSMany experts. Few activations.xINPUT TOKEN0.040.320.060.050.090.280.100.06ROUTER · SOFTMAXN=8 · top-K=2E0E1E2E3E4E5E6E7EXPERTS · FFN_iΣweighted sumx'OUTPUT TOKEN

The total parameter count is × larger; the active parameters per token are only × the size of one expert. With , , you get a model with the parameter capacity of an 8×-bigger dense model, at the compute of a 2×-bigger dense model. This is the entire economic value of the architecture.

How it works mechanically

Each transformer block normally has:

  1. Attention.
  2. Dense FFN (a 2-layer MLP, ~4× embed dim hidden).

MoE replaces step 2 with:

  1. Router. A small linear layer . Computes scores over the experts.
  2. Top- gating. Pick the highest-scoring experts. Send the token through only those.
  3. Weighted sum. Combine the expert outputs weighted by their gate scores.

Each expert is just a normal FFN — same shape as the dense one would be. The router is tiny (~ parameters).

Without intervention, gradient descent finds a degenerate equilibrium: route everything to expert 0, train expert 0 hard, ignore the rest. The standard fix is an auxiliary load-balancing loss added to the training objective. Switch Transformer (Fedus et al., 2022) uses where is the fraction of tokens dispatched to expert and is the mean router probability for that expert. When utilization is uniform, both terms are flat at and the loss is constant; imbalance spikes the product. DeepSeek’s “auxiliary-loss-free” variant instead biases router logits at each step toward under-utilized experts, sidestepping the loss-tuning headache. Without one of these mechanisms, MoE training silently degrades within the first thousand steps.

Why scaling helps

Empirical scaling laws say more parameters means better quality, holding compute fixed. MoE decouples parameters from compute: at the same FLOPs/token, a sparse model has many more parameters to encode knowledge in. Mixtral 8×7B (Mistral, 2023) has 47B total parameters but ~13B active per token, and substantially outperforms dense 13B models. DeepSeek-V3 takes this further: 671B total, 37B active.

MOE · TOP-K SPARSITYMany experts. Two fire.ACTIVATION RATIOK/N=2/8=0.25EXPERT BANKN = 8 · top-K = 2xROUTER · softmax(W_g · x)E0idleE1idleE2idleE3idleE4idleE5idleE6idleE7idleEXPERTS · FFN_iper this token — 0 of 8 experts fire, 8 stay cold
Production MoE models in the wild
  • Mixtral 8x7B / 8x22B — 47B and 141B total, 13B and 39B active
  • DeepSeek-V3 — 671B total, 37B active, 256 experts with 8 active
  • Qwen2.5-MoE — 14B total, 2.7B active, fine-grained expert design
  • GPT-4 — widely reported as MoE, exact config not public
  • Grok-1 — 314B total, ~86B active, 8 experts top-2

The serving problem

MoE is great in training, hard in production. Three issues:

  • VRAM. All experts must be loaded even if only are computed per token. A 47B-total model needs 94GB at fp16 — same as a dense 47B. The compute savings don’t extend to memory.
  • Routing variance. Different tokens go to different experts; batched inference over many queries hits different expert patterns, breaking the contiguous-matrix-multiply pattern that makes dense transformers fast on GPUs.
  • Load imbalance. If the router sends 90% of tokens to expert 0, expert 0 is the bottleneck and the others are idle. Auxiliary load-balancing losses during training mitigate this.

The serving stacks (vLLM, SGLang) have specialized MoE kernels by 2025 — fused gating + grouped GEMM — but throughput per active parameter is still worse than dense, even when total throughput is competitive. See for how production schedulers cope.

Why MoE matters strategically

The frontier-LLM scaling shape since ~2024: MoE for parameter count, dense for active compute. GPT-4 is reportedly MoE; Claude is reportedly dense; the open-weight frontier (DeepSeek, Mixtral, Qwen MoE variants) is solidly MoE. The bet is that knowledge density per FLOP is the binding constraint, and sparse routing is the architecture that solves it.

For specialized small models — , , classifiers — MoE generally isn’t the right call. The whole point of specialization is that the model doesn’t need broad knowledge; a dense FFN is fine. MoE is a frontier-LLM technique.

Go further

Why does MoE need a router and not just averaging?

The point is sparsity — each token should only activate of experts (typically , -). The router is a tiny learned network that picks the top- experts per token. Activating all experts would just be a wider dense FFN; activating keeps compute fixed while parameter count balloons.

What's the catch?

Memory and serving complexity. All experts have to be in GPU memory even if only are active per token, so VRAM bloat is severe. Load-balancing — keeping experts roughly equally utilized — requires auxiliary losses during training; without them, the router collapses to using a few experts. And batched inference across queries with different routing patterns is harder than dense.

Are the experts actually specialized?

Less than the name suggests. Routing patterns rarely correspond to clean human-interpretable categories (math, code, etc). Experts tend to specialize at lower levels — token-class, syntactic role, position. The 'experts' are emergent partitions of the FFN computation, not coherent skill modules.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord