Decoder-Only Model

Q: If decoder-only is so dominant, why use encoders at all?

Representation tasks. An embedding model doesn't generate; it produces a single vector per input. An encoder with bidirectional attention does that better than a causal decoder, because both left and right context inform the representation. Production retrieval stacks remain encoder-heavy for this reason.

Also known as: causal language model, autoregressive transformer, GPT-style model

TL;DR

A decoder-only model is a transformer that generates text autoregressively, one token at a time, with causal self-attention so each position only sees prior tokens.

A decoder-only model is the standard architecture behind every modern frontier LLM — GPT, Claude, Llama, Qwen, Gemini, Mistral. It’s a transformer with one structural choice: every attention layer is causal. Each token can only attend to tokens that came before it. The model is trained, and operates at inference, as a left-to-right next-token predictor.

Every task becomes “continue this prompt.” Unification is the architectural insight that displaced encoder-decoder models for general-purpose LLMs.

The shape of the architecture

A decoder-only model is conceptually the simplest of the three transformer variants: a single stack of transformer blocks, each with causal self-attention followed by a feed-forward layer, residual connections , and layer normalization . Input goes in as token embeddings plus positional encoding (modern models use RoPE ); the output of the final block is projected to logits over the vocabulary; sampling produces the next token.

There is no separate encoder, no cross-attention, no special handling of input vs output. The “prompt” is the prefix the model conditions on; “completion” is the suffix it generates. The architectural elegance is part of why decoder-only displaced the alternatives.

Causal masking, in one place

Inside every attention layer, the score matrix is masked to upper-triangular before the softmax:

Position ‘s output depends only on positions . This single constraint enables the whole architecture: training runs in parallel across all positions of a sequence (each next-token prediction uses only the prefix), and inference runs sequentially with a KV cache that turns generation into a per-token operation.

Why it took over

For a few years, encoder-decoder models (T5, BART) were the default for “input → output” tasks like translation and summarization. Decoder-only won out for several converging reasons:

Why decoder-only displaced encoder-decoder

Unification. Every task becomes “continue this prompt”. No dedicated input/output split, no architectural retraining for new tasks.
Parameter efficiency at scale. A decoder-only model gets to apply all its parameters to both reading and generating. An encoder-decoder splits capacity.
In-context learning. Few-shot prompting works better in decoder-only models because the in-context examples are part of the same generative stream.
Simpler infrastructure. One stack to train, one stack to serve, one set of hyperparameters to tune.

The empirical result was that at frontier scale, decoder-only models matched or beat encoder-decoder on translation and summarization while also being good at everything else.

What it’s good and bad at

Good at: generation, autoregressive reasoning, instruction following, in-context learning, code synthesis. Bad at: producing high-quality fixed-size representations of input. For that, encoder models still dominate — which is why production retrieval stacks remain a mix of encoder embedders and decoder LLMs, each playing to their strengths.

The recent twist is that strong decoder-only base models can be fine-tuned into competitive rerankers and embedders by repurposing the final-token representation or enabling bidirectional attention in a final stage. The encoder/decoder distinction is becoming more about what the model is used for than what it’s structurally built as.

At first glance, causal masking looks like it should serialize training — token i depends on tokens 1 through i-1, so you might expect to compute them in order. The trick is that the next-token prediction loss at position i only depends on the hidden state at position i, which only requires the prefix to be in scope.

In a single forward pass, the model computes hidden states for all positions 1..N simultaneously — the causal mask just zeroes out the upper-triangular attention weights. Each position’s hidden state is a function of the prefix, but the math at each position is independent of the others within the same forward pass. So a sequence of length 8K trains as 8K parallel next-token predictions in one step.

This is what makes pretraining on trillions of tokens feasible. An encoder-decoder model’s training step is dominated by the cross-attention between encoder and decoder, which has a serial structure; a pure decoder-only step is uniformly parallel across positions, mapping cleanly onto GPU tensor cores.

A base model has been trained only on next-token prediction over a massive corpus. It will fluently complete arbitrary text but doesn’t follow instructions in any helpful sense — give it “Translate this to French: hello”, and it might continue with “and then I asked the next question…” rather than producing a translation.

Instruction tuning is a follow-up training stage. The model is fine-tuned on (instruction, response) pairs, often with explicit chat templates that mark turn boundaries (<|user|>, <|assistant|>). After this stage, the same architecture treats the user’s instruction as a prompt that should be followed, not continued.

Modern releases (Llama-3-Instruct, Qwen-Chat, Claude) compose three stages: base pretraining, supervised fine-tuning on instruction data, then preference optimization ( RLHF or DPO ) to align tone and behavior. The architectural backbone — causal decoder — is identical at every stage; only the training objective changes.

Go further

Why did decoder-only beat encoder-decoder for general LLMs?

Two reasons. First, decoder-only is simpler — one stack of blocks instead of two — which scales more cleanly. Second, treating any task (translation, summarization, Q&A) as 'continue this prompt' works surprisingly well at scale, removing the need for a dedicated encoder. Empirically, decoder-only at LLM scale matches or beats encoder-decoder on most generative tasks.

Encoder-decoder model Transformer

What makes the attention 'causal'?

A triangular mask that zeros out attention weights from each position to all later positions. This means token 's representation is computed using only tokens — which is what lets the same model train on all positions in parallel and then generate sequentially at inference time.

Attention Autoregressive generation

If decoder-only is so dominant, why use encoders at all?

Representation tasks. An [embedding model](/concepts/embedding/) doesn't generate; it produces a single vector per input. An [encoder](/concepts/encoder-model/) with bidirectional attention does that better than a causal decoder, because both left and right context inform the representation. Production retrieval stacks remain encoder-heavy for this reason.

Encoder model Bi-encoder Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs