Mamba State-Space Model

Also known as: SSM, selective state-space model, Mamba, S6, Mamba-2

TL;DR

Mamba is a linear-time sequence model that replaces attention with a selective state-space recurrence. It runs in O(N) instead of attention's O(N²), processes infinite context in constant memory.

Mamba is a sequence-modeling architecture based on selective state-space dynamics rather than . It runs in linear time and constant memory per token at inference, processes arbitrarily long contexts without a growing , and is the most credible attention-alternative to emerge in years. As of 2026, it has not displaced the at frontier scale — but it’s the architecture every long-context-curious team is watching.

MAMBA · SELECTIVE STATE-SPACE MODELLinear time, constant memory.xINPUThSTATEyOUTPUTt = 1t = 2t = 3t = 4"the"x1"cat"x2"sat"x3"on"x4AAABBBBh1h2h3h4CCCCy1y2y3y4h₀ = 0ht = A · ht−1 + B · xtyt = C · htMEMORY FOOTPRINTattentionN × NO(N²)mambaFIXED NO(N)vs

The core idea

A state-space model maintains a hidden state that evolves over time according to a linear recurrence:

This is structurally a state-space model from control theory, and structurally also an RNN — except the matrices are designed so the recurrence can be reformulated as a parallel scan, making training as fast as a transformer over the same sequence.

Earlier SSMs (S4, S5) kept these matrices fixed across time. That made them efficient but expressively limited — the model couldn’t, for example, dynamically choose to “pay attention to” or “ignore” a particular token, because the dynamics were the same regardless of input.

Mamba’s contribution (2023, Gu and Dao) is to make , and a discretization step size depend on the input . This selective mechanism gives the model the ability to gate which inputs propagate into the state — a much-cheaper analog of what attention’s softmax does. The architecture name “S6” (selective S4) and the algorithm name “selective scan” both refer to this.

Why linear time matters

Attention is in sequence length. Mamba is . For short contexts (under 4K), the constants favor attention — modern GPUs are highly optimized for the dense matrix multiplies attention requires, and FlashAttention narrows the gap further. But beyond ~16K tokens, the asymptotic gap dominates, and beyond ~64K tokens Mamba is dramatically faster per FLOP.

The bigger asymptotic difference is at inference. A transformer maintains a KV cache that grows linearly with context length and dominates GPU memory at long contexts. Mamba’s hidden state is fixed size — a few hundred floats per layer — regardless of how much context has been processed. At 1M tokens of context, the difference is multiple gigabytes per sequence vs. multiple megabytes.

The selective scan computes the recurrence where depend on the input. Naively this is sequential, but it’s an associative scan: the partial states can be combined left-to-right OR in tree-fashion in parallel. Mamba uses a custom CUDA kernel (analogous to FlashAttention’s role for transformers) that fuses the scan, the discretization, and the gating, computed entirely in SRAM without writing the full hidden-state trajectory to HBM. The kernel is the engineering work that made selective SSMs practical.

Where it underperforms transformers

The honest assessment as of 2026: Mamba matches or beats transformers on language modeling perplexity at a given parameter count, but lags on tasks that require associative recall — fetching a specific earlier token by content. The toy benchmark is the multi-query associative recall (MQAR) task: store many key-value pairs in the prefix, then query them. Attention solves this trivially because it can directly look up any past token. Mamba’s fixed-size state can only store so many key-value pairs before losing precision.

In practical terms: long-document question-answering, retrieval over a long context, and exact citation tasks all favor transformers. Pure Mamba has not reached frontier-LLM quality on these.

The hybrid play

The architectures that have shipped at meaningful scale are hybrids. Jamba (AI21, 2024) interleaves Mamba blocks with sparse attention layers and . Zamba and similar models do the same. The hybrid keeps Mamba’s cheap long-range mixing while restoring transformers’ precise lookup, with a few attention layers’ worth of KV cache instead of every layer.

Why it might still matter long-term

The asymptotic argument doesn’t go away. As context windows push past 1M tokens and per-token decode latency becomes the binding constraint, the linear-time / constant-memory profile gets more attractive. The trajectory of selective SSMs since 2023 has been steady improvement; the architecture is not a dead end. If a hybrid Mamba-transformer at frontier scale closes the recall gap while keeping the long-context efficiency, the production calculus changes.

For now, Mamba is the best-credentialed alternative architecture and the field’s hedge against “what if attention isn’t actually optimal.” Worth understanding even if you’re not deploying it.

Go further

What was wrong with earlier state-space models like S4?

S4 and its predecessors used time-invariant dynamics — the transition matrix was the same at every step. That's efficient (you can run it in parallel via FFT) but limits expressiveness; the model can't selectively forget or focus on specific tokens. Mamba's contribution is making the dynamics input-dependent (selective) while preserving a parallel-scan algorithm that recovers most of the efficiency.

Why is constant decode memory a big deal?

Transformers' KV cache grows linearly with context — at 128K tokens, that's gigabytes per sequence. Mamba's hidden state is fixed size regardless of context length: a few megabytes at most. For very long contexts and high batch sizes, this is the difference between fitting on a single GPU and not fitting at all. The catch is whether the fixed-size state can store the same information that an unbounded KV cache stores.

Why have hybrid Mamba-attention architectures emerged?

Pure Mamba is competitive on many tasks but lags transformers on retrieval-heavy benchmarks where you really do need to look up specific past tokens. Hybrids like Jamba and Zamba interleave Mamba blocks (cheap for long-range mixing) with sparse attention layers (precise lookup), getting most of Mamba's efficiency with the transformer's lookup ability.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord