Decoder-Only Model

Also known as: causal language model, autoregressive transformer, GPT-style model

TL;DR

A decoder-only model is a transformer that generates text autoregressively, one token at a time, with causal self-attention so each position only sees prior tokens.

A decoder-only model is the standard architecture behind every modern frontier — GPT, Claude, Llama, Qwen, Gemini, Mistral. It’s a with one structural choice: every layer is causal. Each token can only attend to tokens that came before it. The model is trained, and operates at inference, as a left-to-right next-token predictor.

DECODER-ONLY MODEL · CAUSAL · AUTOREGRESSIVEOne token at a time, left to right.POS 0POS 1POS 2POS 3POS 4POS 5?POS 6?BLOCK 1ℓ = 1ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 2ℓ = 2ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 3ℓ = 3ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 4ℓ = 4ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNCAUSAL ATTENTION MASK · LOWER-TRIANGULARα[i,j] for j ≤ i ; otherwise −∞thecatsatonthemat.thecatsatonthemat.QUERY ROWi = 4, last positionMASK GROWS BY ONE ROW + COLUMNNEXT-TOKEN · p(x | prefix)softmax over the vocabulary0.66mat0.16rug0.09floor0.05roof0.04sofaAPPEND · REPEATAT INFERENCEa KV cache stores K and Vfor every prior position —generation becomes O(1) per token,not O(N) per token.

Every task becomes “continue this prompt.” Unification is the architectural insight that displaced encoder-decoder models for general-purpose LLMs.

The shape of the architecture

A decoder-only model is conceptually the simplest of the three transformer variants: a single stack of transformer blocks, each with causal self-attention followed by a feed-forward layer, , and . Input goes in as token embeddings plus (modern models use ); the output of the final block is projected to logits over the vocabulary; sampling produces the next token.

There is no separate encoder, no cross-attention, no special handling of input vs output. The “prompt” is the prefix the model conditions on; “completion” is the suffix it generates. The architectural elegance is part of why decoder-only displaced the alternatives.

Causal masking, in one place

Inside every attention layer, the score matrix is masked to upper-triangular before the softmax:

Position ‘s output depends only on positions . This single constraint enables the whole architecture: training runs in parallel across all positions of a sequence (each next-token prediction uses only the prefix), and inference runs sequentially with a that turns generation into a per-token operation.

Why it took over

For a few years, encoder-decoder models (T5, BART) were the default for “input → output” tasks like translation and summarization. Decoder-only won out for several converging reasons:

Why decoder-only displaced encoder-decoder
  • Unification. Every task becomes “continue this prompt”. No dedicated input/output split, no architectural retraining for new tasks.
  • Parameter efficiency at scale. A decoder-only model gets to apply all its parameters to both reading and generating. An encoder-decoder splits capacity.
  • In-context learning. Few-shot prompting works better in decoder-only models because the in-context examples are part of the same generative stream.
  • Simpler infrastructure. One stack to train, one stack to serve, one set of hyperparameters to tune.

The empirical result was that at frontier scale, decoder-only models matched or beat encoder-decoder on translation and summarization while also being good at everything else.

What it’s good and bad at

Good at: generation, reasoning, instruction following, in-context learning, code synthesis. Bad at: producing high-quality fixed-size representations of input. For that, models still dominate — which is why production retrieval stacks remain a mix of encoder embedders and decoder LLMs, each playing to their strengths.

The recent twist is that strong decoder-only base models can be fine-tuned into competitive rerankers and embedders by repurposing the final-token representation or enabling bidirectional attention in a final stage. The encoder/decoder distinction is becoming more about what the model is used for than what it’s structurally built as.

At first glance, causal masking looks like it should serialize training — token i depends on tokens 1 through i-1, so you might expect to compute them in order. The trick is that the next-token prediction loss at position i only depends on the hidden state at position i, which only requires the prefix to be in scope.

In a single forward pass, the model computes hidden states for all positions 1..N simultaneously — the causal mask just zeroes out the upper-triangular attention weights. Each position’s hidden state is a function of the prefix, but the math at each position is independent of the others within the same forward pass. So a sequence of length 8K trains as 8K parallel next-token predictions in one step.

This is what makes pretraining on trillions of tokens feasible. An encoder-decoder model’s training step is dominated by the cross-attention between encoder and decoder, which has a serial structure; a pure decoder-only step is uniformly parallel across positions, mapping cleanly onto GPU tensor cores.

A base model has been trained only on next-token prediction over a massive corpus. It will fluently complete arbitrary text but doesn’t follow instructions in any helpful sense — give it “Translate this to French: hello”, and it might continue with “and then I asked the next question…” rather than producing a translation.

Instruction tuning is a follow-up training stage. The model is fine-tuned on (instruction, response) pairs, often with explicit chat templates that mark turn boundaries (<|user|>, <|assistant|>). After this stage, the same architecture treats the user’s instruction as a prompt that should be followed, not continued.

Modern releases (Llama-3-Instruct, Qwen-Chat, Claude) compose three stages: base pretraining, supervised fine-tuning on instruction data, then preference optimization ( or ) to align tone and behavior. The architectural backbone — causal decoder — is identical at every stage; only the training objective changes.

Go further

Why did decoder-only beat encoder-decoder for general LLMs?

Two reasons. First, decoder-only is simpler — one stack of blocks instead of two — which scales more cleanly. Second, treating any task (translation, summarization, Q&A) as 'continue this prompt' works surprisingly well at scale, removing the need for a dedicated encoder. Empirically, decoder-only at LLM scale matches or beats encoder-decoder on most generative tasks.

What makes the attention 'causal'?

A triangular mask that zeros out attention weights from each position to all later positions. This means token 's representation is computed using only tokens — which is what lets the same model train on all positions in parallel and then generate sequentially at inference time.

If decoder-only is so dominant, why use encoders at all?

Representation tasks. An [embedding model](/concepts/embedding/) doesn't generate; it produces a single vector per input. An [encoder](/concepts/encoder-model/) with bidirectional attention does that better than a causal decoder, because both left and right context inform the representation. Production retrieval stacks remain encoder-heavy for this reason.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord