Why does MoE need a router and not just averaging?
The point is sparsity — each token should only activate
Also known as: MoE, sparse mixture of experts, sparse MoE
An architecture that replaces the dense feed-forward layer in a transformer with a sparse routing layer over many expert subnetworks — each token activates only a few experts.
A mixture-of-experts (MoE) transformer replaces the dense feed-forward (FFN) layer in each transformer block with a sparse mixture:
The total parameter count is
Each transformer block normally has:
MoE replaces step 2 with:
Each expert is just a normal FFN — same shape as the dense one would be. The router is tiny (~
Without intervention, gradient descent finds a degenerate equilibrium: route everything to expert 0, train expert 0 hard, ignore the rest. The standard fix is an auxiliary load-balancing loss added to the training objective. Switch Transformer (Fedus et al., 2022) uses
Empirical scaling laws say more parameters means better quality, holding compute fixed. MoE decouples parameters from compute: at the same FLOPs/token, a sparse model has many more parameters to encode knowledge in. Mixtral 8×7B (Mistral, 2023) has 47B total parameters but ~13B active per token, and substantially outperforms dense 13B models. DeepSeek-V3 takes this further: 671B total, 37B active.
MoE is great in training, hard in production. Three issues:
The serving stacks (vLLM, SGLang) have specialized MoE kernels by 2025 — fused gating + grouped GEMM — but throughput per active parameter is still worse than dense, even when total throughput is competitive. See continuous batching for how production schedulers cope.
The frontier-LLM scaling shape since ~2024: MoE for parameter count, dense for active compute. GPT-4 is reportedly MoE; Claude is reportedly dense; the open-weight frontier (DeepSeek, Mixtral, Qwen MoE variants) is solidly MoE. The bet is that knowledge density per FLOP is the binding constraint, and sparse routing is the architecture that solves it.
For specialized small models — rerankers , embedders , classifiers — MoE generally isn’t the right call. The whole point of specialization is that the model doesn’t need broad knowledge; a dense FFN is fine. MoE is a frontier-LLM technique.
The point is sparsity — each token should only activate
Memory and serving complexity. All
Less than the name suggests. Routing patterns rarely correspond to clean human-interpretable categories (math, code, etc). Experts tend to specialize at lower levels — token-class, syntactic role, position. The 'experts' are emergent partitions of the FFN computation, not coherent skill modules.