Encoder-Decoder Model

Also known as: seq2seq transformer, T5, BART, sequence-to-sequence

TL;DR

An encoder-decoder model is a transformer with two stacks: an encoder reads the input bidirectionally, then a decoder generates the output autoregressively while cross-attending to the encoder's representations.

An encoder-decoder model is the original shape proposed in Attention Is All You Need (Vaswani et al., 2017). Two stacks of transformer blocks operate together: the encoder reads the entire input with bidirectional , producing a sequence of representations; the decoder generates the output token-by-token with causal attention, while at every layer also performing cross-attention into the encoder’s output. T5, BART, mBART, and the original Google Translate models follow this template.

ENCODER–DECODER · TRANSFORMEROne stack reads, another writes — joined by cross-attention.ENCODER · SOURCE“Le chat est sur le tapis.”DECODER · TARGET SO FAR“the cat is on …”Lechatestsurletapist=0<s>t=1thet=2catt=3ist=4ons=0s=1s=2s=3s=4s=5BLOCK 1ℓ = 1ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 2ℓ = 2ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 3ℓ = 3ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBIDIRECTIONAL · 2 SUBLAYERSBLOCK 1ℓ = 1ATTXFFNATTXFFNATTXFFNATTXFFNATTXFFNBLOCK 2ℓ = 2ATTXFFNATTXFFNATTXFFNATTXFFNATTXFFNBLOCK 3ℓ = 3ATTXFFNATTXFFNATTXFFNATTXFFNATTXFFNCAUSAL · 3 SUBLAYERSK, VK, VK, VCROSS-ATTENTION · THE BRIDGEqueries from the decoder, keys and values from the encoderCROSS-ATTENTION MATRIX · α[t, s]M decoder queries × N encoder keys — rectangular, not squareLechatestsurletapis<s>thecatison→ ENCODER POSITION (KEYS / VALUES)↑ DECODER POSITION (QUERY)M = 5 ≠ N = 6NEXT-TOKEN DISTRIBUTION · p(y₅ | y<₅, x)softmax over the vocabulary, conditioned on source AND target so far0.71mat0.16rug0.09floor0.04sofay₅ = ARGMAX SOFTMAX(W · h_dec[t = 5])“the cat is on the mat.”

How information flows

A translation example clarifies the design. Input: “Le chat est sur le tapis.” Output: “The cat is on the mat.”

  1. Encode. The full source sentence runs through the encoder stack. Bidirectional self-attention lets every source token’s representation incorporate every other source token. Output: a sequence of vectors, one per source token, all available to the decoder.
  2. Decode. Generation starts with a special start-of-sequence token. At each decoder layer:
    • Causal self-attention over what’s been generated so far.
    • Cross-attention where queries come from the decoder, keys and values come from the encoder’s output. This is where the decoder “reads” the source.
    • Feed-forward.
  3. Sample the next token, append it, repeat until end-of-sequence.

The decoder operates autoregressively, like a , but it has full bidirectional access to the source via cross-attention.

On paper, you can simulate cross-attention by prepending the source sequence to the decoder’s prompt and letting causal self-attention do the rest — that’s how decoder-only models effectively handle translation today. The architectural difference is where the source representation lives. In an encoder-decoder, the encoder produces source vectors once with full bidirectional context, then every decoder step cross-attends to that fixed cache. In decoder-only, the source tokens are interleaved into the same causal stream, so each source token only sees the source tokens before it (no right-side context) and the source representation gets recomputed implicitly through the same KV cache the generated tokens use. The encoder-decoder split is cleaner conceptually and slightly more parameter-efficient when input and output are clearly separable; decoder-only wins on simplicity at scale.

Pretraining objectives

T5 (Raffel et al., 2020) framed every task as “text in, text out” with task-specific prefixes (e.g., "translate English to German: ..."). Pretraining used a span-corruption objective: randomly drop spans from the input, train the encoder-decoder to reconstruct them. BART used a similar denoising objective with several corruption strategies.

These models excel at tasks with clear input-output structure: translation, summarization, question answering, structured-data-to-text generation. They were the workhorses of NLP roughly 2019-2022.

Why decoder-only displaced them at scale

By the time models hit ~10B+ parameters, decoder-only had pulled ahead for several reasons:

  • Unified interface. Treat every task as “continue this prompt” and you don’t need separate encoder/decoder weights for each modality of input/output.
  • Parameter efficiency. Decoder-only applies its full parameter budget to one stack. Encoder-decoder splits compute.
  • In-context learning. Few-shot prompting fits naturally into a single causal stream; encoder-decoder needs adaptation.
  • Simpler engineering. One model, one tokenizer, one inference path.

The empirical scaling result: at frontier scale, decoder-only matches or beats encoder-decoder on translation and summarization too — the original strongholds.

Where they still live

For most production AI the choice is between a LLM for generation and an for representation. Encoder-decoder is the historical middle ground that lost both ends.

Encoder-decoder lost the scaling race because the split it imposed stopped paying its own bill. Decoder-only is strictly more flexible at LLM scale; encoder-only is strictly cheaper for representation tasks. The middle has no home.

Where encoder-decoder still ships
  • Google Translate’s production MT stack — heavily tuned T5/mBART variants per language pair
  • T5 / Flan-T5 small models for structured-output tasks (classification with explanations, table-to-text)
  • RETRO and Atlas — research systems that fuse retrieved documents through cross-attention
  • Whisper (OpenAI’s speech model) — the canonical modern encoder-decoder, where the audio/text split is genuinely unavoidable
  • BART summarization fine-tunes when a full LLM call is overkill
Go further

Why split into encoder and decoder at all?

It cleanly separates 'understand input' from 'generate output'. The encoder gets bidirectional attention, ideal for representation; the decoder gets causal attention, ideal for generation. Cross-attention bridges them. For tasks where input and output are clearly distinct (translate French → English), this separation is intuitive and historically effective.

Are encoder-decoder models obsolete?

Not quite — they remain competitive on classical seq2seq tasks like machine translation, summarization, and structured generation, especially at smaller scales. T5-style models also power many internal tools at Google. But for general-purpose LLMs, [decoder-only](/concepts/decoder-only-model/) won the scaling race because of architectural simplicity and unified prompt-based interfaces.

What's cross-attention and how is it different from self-attention?

In self-attention, queries, keys, and values all come from the same sequence. In cross-attention, queries come from the decoder (the sequence being generated) but keys and values come from the encoder (the input). It's how the decoder reads from the encoder's representation at every layer of generation.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord