Also known as: SAE, sparse autoencoder, dictionary learning for transformers
TL;DR
A wide, sparsely-activated autoencoder trained on transformer activations. The learned dictionary recovers monosemantic features — directions that fire for a single human-understandable concept rather than the polysemantic mush of raw neurons.
A sparse autoencoder (SAE) is a wide, sparsely-activated autoencoder trained on the internal activations of a transformer . Given a residual-stream or MLP activation , the SAE learns:
where the feature dimension of is much wider than — typically to — and a sparsity constraint forces only a small number of features to be non-zero per token. The reconstruction loss pushes ; the sparsity constraint pushes the network to use as few features as possible to do it.
The payoff is interpretability. Each row of corresponds to a feature direction in the original activation space, and — empirically, when training works — those directions are monosemantic: each one fires for a single, human-understandable concept.
The superposition hypothesis
Why are raw transformer neurons such a mess? The current best answer is superposition: the model has more features it wants to represent than it has dimensions to represent them in, so it packs features into non-orthogonal directions and accepts the interference. A single neuron ends up firing for a long list of unrelated concepts — “DNA sequences”, “C++ pointer code”, “Russian text”, and “names of bridges” might all activate the same neuron at different strengths.
SAEs are dictionary learning applied to this problem. By projecting activations into a wider sparse space, the network can give each feature its own direction without having to share. The sparsity penalty is what makes the directions disentangle — without it the SAE just learns an over-parameterized identity.
The two recipes
Sparsity penalties in practice
L1 penalty. Anthropic’s Towards Monosemanticity (2023) used . The L1 penalty pushes activations toward zero with a soft, tunable strength. Easy to optimize but has a shrinkage bias — even active features get pulled toward zero, which biases the reconstruction.
Top-k. Anthropic’s Scaling Monosemanticity (2024) and OpenAI’s GPT-4 SAE work use a hard top-k constraint: keep only the largest activations per token, zero the rest. No shrinkage bias, one cleaner hyperparameter ( instead of ), but harder to optimize because gradients only flow through the top-k features.
Variants. BatchTopK (top-k across the whole batch instead of per token), JumpReLU (a learned threshold per feature), Gated SAEs (separate the magnitude and sparsity decisions). These are all attempts to keep the cleanness of top-k while making optimization more forgiving.
Feature splitting
One of the more conceptually interesting findings: as you make the SAE wider, features split into more specific sub-features. A small SAE might have a single feature for “Golden Gate Bridge”. A wider SAE on the same model decomposes that into “Golden Gate Bridge in fog”, “Golden Gate Bridge in photographs”, “Golden Gate Bridge in tourist contexts”, and so on.
This implies there isn’t a single canonical feature dictionary — there’s a hierarchy of features at different granularities, and the SAE width is the knob that selects how fine-grained the decomposition is. Choosing the width is a research decision, not a configuration detail.
Polysemantic neurons aren’t a bug; they’re how the model fits more features than it has dimensions. SAEs unpack that compression.
The reconstruction-interpretability trade-off
There’s a hard trade-off between how well the SAE reconstructs the original activations and how interpretable its features are. Push sparsity hard ( low, high) and you get a small number of very clean features, but the reconstruction loses fidelity — the model running through the SAE-reconstructed activations performs noticeably worse than the original. Loosen sparsity and the reconstruction is faithful, but features start to merge and lose their monosemanticity.
The standard metric is “loss recovered”: how much of the original model’s loss does running through the SAE preserve, vs the floor of zeroing the activation entirely? Modern SAEs hit 90%+ loss recovery with ~50-200 active features per token at residual-stream widths of 4096-16384 against base widths of 768-4096.
The residual stream is the running sum of every block’s contribution — it’s the common bus that every attention and MLP layer reads from and writes to. Features in the residual stream are the most directly causal of the model’s downstream behavior, and they’re the natural target for circuit-level analysis (where you want to trace how a feature in layer influences a feature in layer ).
MLP-output SAEs are useful too — MLPs are where most of the network’s computation happens, as opposed to the residual stream’s bookkeeping. Anthropic’s recent work uses both: residual-stream SAEs to find the features the model uses, MLP-output SAEs to find the features the model computes. They’re complementary views.
Attention-output SAEs are a third axis. Activation type matters less than the principle: any layer where the model represents information in superposition is a candidate.
Why this matters
For ten years, the standard answer to “what does this neuron do?” was “it fires for a lot of things, none of them especially clean”. SAEs change that. With a trained SAE you can take any token, list its top active features, and read off a human-understandable description of what the model is representing at that position. You can clamp specific features to high values and watch the model’s behavior shift in predictable ways (Anthropic’s Golden Gate Claude demo). You can compare features across layers and start to describe how representations evolve through the network.
It’s the most tractable “open the black box” recipe the field has produced. Not solved — feature splitting, dead features, and reconstruction-interpretability trade-offs all remain open — but a foothold where there used to be a wall.
Go further
Why are raw transformer neurons hard to interpret in the first place?
Transformer activations are polysemantic — a single neuron fires for many unrelated concepts because the network packs more features than it has dimensions. This is the superposition hypothesis: with neurons and features to represent, the model uses non-orthogonal directions and accepts interference. SAEs unpack superposition by projecting into a wider, sparse basis where each direction can be one feature.
L1 penalties (the original recipe in Anthropic's Towards Monosemanticity) push activations toward zero with a soft penalty . Top-k SAEs (the Scaling Monosemanticity and OpenAI's GPT-4 SAE recipe) hard-zero all but the largest activations per token. Top-k is easier to tune (one integer instead of a continuous coefficient), avoids the L1 shrinkage bias that pulls active features toward zero, and tends to give cleaner features at the cost of harder optimization. Most modern work uses top-k or its variants (BatchTopK, JumpReLU).
What does feature splitting mean and why does it happen?
When you increase the SAE width (the dictionary size), features split into more specific sub-features. A single 'Golden Gate Bridge' feature in a small SAE becomes separate features for 'Golden Gate Bridge in fog', 'Golden Gate Bridge from photographs', 'Golden Gate Bridge in tourist contexts' in a wider one. This is evidence that there isn't a single 'true' feature dictionary — there's a hierarchy of features at different granularities, and the SAE width determines which level you see.