Speculative Decoding

Also known as: speculative sampling, draft-and-verify, Medusa decoding

TL;DR

Use a small 'draft' model to predict the next several tokens, then have the big 'target' model verify them in a single forward pass. The standard latency-reduction trick for LLM inference — typically 2-4× faster generation at the same output quality.

Speculative decoding (Leviathan et al., 2023; concurrently Chen et al.) is the standard latency-reduction trick for LLM inference. The intuition: most tokens an LLM generates are easy — common syntactic glue, predictable continuations of established context. A much smaller, much faster model can predict them correctly most of the time. So: have the small “draft” model propose the next several tokens, and have the big “target” model verify them in a single forward pass that would have produced just one token.

SPECULATIVE DECODINGGuess fast, verify in parallel.DRAFT MODELsmall · fast · autoregressive~3 ms / TOKEN+1"the"+2"cat"+3"sat"+4"on"+5"the"+6"mat"+7"."ALL POSITIONS · IN PARALLELTARGET MODEL~30 ms · ONE FORWARD PASSlarge · slow · verifies all K positions at onceACCEPTED PREFIXREJECTED TAIL · DISCARDED"the""cat""sat""on""a"RESAMPLED····5 tokens emitted from 1 big forward pass · ~5× speedup vs naive decodeOUTPUT DISTRIBUTION IS UNCHANGED · ONLY LATENCY MOVESA SMALL DRAFT MODEL · A LARGE TARGET MODEL

The mechanic

  1. Draft. A small model (the draft or speculator) produces candidate tokens autoregressively. This is cheap — small model, forward passes, but each is fast.
  2. Verify. The big model runs a single forward pass over the entire -token candidate. Because transformers process all positions in parallel during prefill-style passes, this is roughly the same cost as one normal decode step.
  3. Accept. For each draft token, accept it if a sampled token from the target model would have matched. Stop at the first rejection. Resample the rejected position from the target’s true distribution.

If the draft is right on of the tokens, you produce output tokens (the accepted draft tokens plus one resampled token at the rejection point) at the cost of one big-model forward pass. Best case: tokens per big forward pass. Worst case: 1 token, when the draft is wrong on the very first prediction.

Why output quality is preserved

The verification is statistically exact. Either:

  • Greedy verification. Accept draft token iff it matches the argmax of the target’s distribution. Trivially exact.
  • Probabilistic verification. Accept with probability . On rejection, sample from a renormalized distribution. The math (Leviathan et al., 2023) shows the resulting samples are exactly distributed as if you’d sampled directly from the target.

You’re not approximating the target; you’re running it, just more efficiently when the easy case applies. Speculation never changes the output distribution.

Speculative decoding buys speed by exploiting the asymmetry between proposing and verifying tokens — like a code reviewer skimming a draft from a junior, only stopping to rewrite the parts that are wrong.

Acceptance rate determines the win

Speedup where is the expected number of accepted tokens and is the draft cost ratio. Higher acceptance → bigger win. Empirically:

Acceptance rates by draft method
  • Generic small draft (Llama-1B drafting for Llama-70B) — 40-60% acceptance, 2-2.5× speedup.
  • Distilled draft (small model trained to match target distribution) — 70-80% acceptance, 3-4× speedup.
  • Medusa heads (parallel prediction heads on target itself) — 60-70% acceptance, no draft VRAM cost.
  • EAGLE / EAGLE-2 (shallow draft alongside target) — 70-85% acceptance, minimal extra parameters.
  • N-gram speculative (lookup-based, no model) — 30-50% acceptance on long-context tasks where text repeats.

Why it matters in production

Latency, mostly. Speculative decoding at batch 1 takes a 70B model’s per-token latency from ~30ms to ~10ms — the difference between feeling slow and feeling responsive in an interactive chat. For batched serving, the win is smaller (continuous batching already pushes throughput high) but speculation still reduces tail latency for low-batch traffic regimes.

Modern serving stacks (vLLM, TensorRT-LLM, SGLang) ship speculative decoding as a built-in option. The trend is toward “self-speculative” methods (Medusa, EAGLE) that avoid the separate-draft VRAM overhead — increasingly the default, with the management complexity that entails.

Reranker analog

Speculation isn’t directly applicable to non-autoregressive models like cross-encoder rerankers — there’s no token-by-token decode loop. But the spirit — use a cheap proxy to filter, expensive verifier on survivors — is the structure of any retrieval pipeline (cheap → expensive on top). Same engineering shape, different mechanism.

Decoding is memory-bandwidth-bound, not compute-bound, on modern GPUs serving large models. At batch size 1, the bottleneck is pulling the model’s weights from HBM into compute units — a 70B model at FP16 is 140GB, which dominates per-token latency at typical decoding throughput.

Verification, in contrast, runs all candidate tokens through one forward pass. The weights still need to be pulled from HBM, but you get tokens of work out of that single load — the marginal cost of the additional positions is in compute, which is plentiful. For typical settings (, FlashAttention-3 enabled), the verification step costs maybe 1.05-1.15× the cost of a single decode step, but yields up to output tokens. That’s the entire engineering win.

This also explains why speculation breaks down at high batch sizes. With batch 32, the target’s forward pass is already pulling 32 sequences’ worth of work per weight load — it’s no longer memory-bandwidth-bound, and adding extra positions per sequence costs roughly × more time. The asymmetry that makes speculation work has been eaten by batching.

A separate draft model has obvious overhead: extra VRAM (a 1B draft costs 2GB at FP16), an extra training pipeline, and ongoing maintenance to keep the draft aligned with the target. Self-speculative methods eliminate the separate model entirely.

Medusa (Cai et al., 2024) adds extra “Medusa heads” to the target model that predict tokens 2, 3, 4 ahead in parallel from the same residual stream. Training is a quick fine-tune (a few hundred GPU-hours) that doesn’t touch the original weights. Inference: one forward pass produces both the verified token and three speculative continuations.

EAGLE (Li et al., 2024) uses a single tiny autoregressive head — typically one transformer layer — that runs alongside the target. The head predicts the embedding of the next token (a regression target) rather than logits, then the target verifies in the next step. Higher acceptance than Medusa with smaller parameter overhead.

By 2025 these are the production default in vLLM, TensorRT-LLM, and SGLang. The maintenance cost of separate draft models killed the original speculative-decoding architecture for serious deployments — it remains useful as a research baseline and for serving setups where the draft model is already on hand.

Go further

Why is the output quality the same as without speculation?

Acceptance is exact: the verifier samples each token from the target model's true distribution, accepting the draft only when it would have produced the same token (or with calibrated probability). Output is statistically indistinguishable from greedy/sampled decoding from the target alone. Speculation buys speed; it doesn't change distribution.

What's the typical speedup?

2-4× for typical decode workloads with a 7B draft and a 70B target. The win depends on draft acceptance rate, which depends on how aligned the draft and target are. A draft trained via distillation from the target hits 70-80% acceptance; a generic small model hits 50-60%.

What's Medusa / EAGLE / self-speculative decoding?

Variants that avoid the separate-draft-model overhead. Medusa adds extra heads to the target model that predict tokens 2, 3, 4 ahead in parallel; EAGLE uses a single shallow draft that runs alongside the target. Both eliminate the draft-model VRAM overhead at modest acceptance-rate cost. Increasingly the default in 2025+ serving stacks.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord