Encoder Model

Also known as: bidirectional encoder, BERT, encoder-only transformer

TL;DR

An encoder model is a transformer that reads a sequence with bidirectional attention and produces a contextual representation for each token — typically pooled into a single vector.

An encoder model is a configured to read a sequence end-to-end and produce a representation — not generate new tokens. The defining feature is bidirectional : every position attends to every other position with no causal mask, so the representation of each token is informed by both its left and right context. BERT (Devlin et al., 2018) is the canonical example; nearly every production embedding model and reranker descends from this lineage.

ENCODER MODEL · BIDIRECTIONAL TRANSFORMEREvery token reads from every other token.POS 0[CLS]POS 1thePOS 2riverPOS 3[MASK]POS 4isPOS 5floodingBLOCK 1ℓ = 1ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 2ℓ = 2ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 3ℓ = 3ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBLOCK 4ℓ = 4ATTFFNATTFFNATTFFNATTFFNATTFFNATTFFNBIDIRECTIONALattention is unmasked —every token reads fromBOTH past and future.ENCODER · FULL ATTENTIONα[i,j] for ALL (i,j) pairsCLStheriverMASisfloodingCLStheriverMASisfloodingMASK = NONEDECODER · FOR CONTRASTα[i,j] only for j ≤ iCLStheriverMASisfloodingMASK = j > i → −∞VSthe onlydifference.[CLS] POOLED EMBEDDINGsentence embeddingused by every embedding model, reranker,and classifier in production retrieval.MLM HEAD · p([MASK])0.58water0.21level0.14bank0.07bed[MASK] = ARGMAX SOFTMAX(WᵤH[MASK])“the river water is flooding”

What encoder output looks like

Run a string through an encoder and you get back a sequence of contextual vectors — one per input token, each in the model’s hidden dimension. Two main ways to use them:

  • Pool into a single vector. Average the token vectors, or take the special [CLS] token’s representation, to get one fixed-size embedding for the whole input. This is what every embedding model does.
  • Use the per-token vectors directly. For sequence labeling (NER, POS tagging) or token-level classification.

A second linear head on top turns this into whatever you need: logits for classification, a single similarity score for reranking, an embedding vector for retrieval.

During masked-language-model pretraining, BERT prepends a special [CLS] token to every sequence and uses its final-layer hidden state as the input to a next-sentence-prediction head. That trains [CLS] to aggregate information from every other token through self-attention — it’s the only token whose pretraining objective rewards summarizing the whole sequence. After pretraining, fine-tuning a classifier or embedder on top of [CLS] continues that aggregation pattern, and it works well in practice. The alternative — mean-pooling all token vectors — is competitive and often slightly better for retrieval embeddings, because mean-pooling is less brittle when fine-tuning shifts what each token attends to. Modern embedding models (E5, BGE, jina, zembed-1) overwhelmingly use mean-pooling.

Why bidirectional attention matters here

use causal attention because they generate token-by-token: position must only see positions . Encoders don’t generate, so the constraint is unnecessary — and removing it is genuinely better for representation quality. The word bank in “bank account” and “river bank” only resolves with right-side context. Causal-only models can encode it via the next token’s representation, but it’s strictly less direct.

What encoders are used for

Production retrieval is built on encoders almost end-to-end:

  • — encoder + pooling head, trained with contrastive loss to produce semantically meaningful vectors. Used for .
  • — encoder applied to [query, document] pairs concatenated, with a scoring head on top. The standard reranker architecture.
  • Classifiers — sentiment, topic, intent, faithfulness checking — all encoder + classification head.
  • — encoder used twice (once for query, once for document) with their pooled vectors compared.

Common encoder backbones

The original BERT-base and BERT-large are mostly historical now. Today’s working encoders include RoBERTa, DeBERTa-v3, ELECTRA, E5, BGE, and the modern decoder-derived encoders (encoder-style adaptations of Qwen, Llama). For retrieval specifically, models pretrained or fine-tuned with contrastive objectives — e5, bge, jina-embeddings — outperform vanilla BERT-style backbones.

Why production AI hasn’t moved away from encoders

There is a tempting narrative that decoder LLMs subsume everything. They don’t — not for representation tasks. An encoder-based reranker runs at 10–100× less cost than asking an LLM to score a (query, document) pair, and is more accurate per dollar. Encoders are smaller, faster, easier to serve, and fit into the hot path of a query.

Decoder LLMs got famous; encoder models stayed load-bearing. Every embedding lookup, every reranker call, every faithfulness check in production AI is an encoder pretending to be invisible.

Encoder-derived models in a typical production stack
  • Embedding model for first-pass retrieval — encoder + mean-pooling, contrastively trained
  • Cross-encoder reranker — encoder over [query, doc] pairs with a regression head
  • Faithfulness / entailment classifier — encoder + 3-way head (entail / contradict / neutral)
  • Intent or topic classifier for routing — encoder + softmax over labels
  • PII / toxicity / jailbreak detector — encoder + binary head
  • Token-level NER for redaction — encoder + per-token classification head
Go further

Why use bidirectional attention instead of causal?

Encoders aren't generating; they're producing representations. To represent a word well, you want both its left and right context. 'Bank' before 'of the river' and 'bank' before 'account' should embed differently — that's only possible if the attention layer can see what's to the right.

Are encoder models still relevant in the LLM era?

Hugely so. Every production [embedding](/concepts/embedding/) model, every [reranker](/concepts/reranker/), every classifier, every faithfulness checker is a fine-tuned encoder. They're 10–100× cheaper to run than decoder LLMs and beat them on the narrow tasks they're trained for. Encoders are quietly the workhorses of production AI.

How are encoders pretrained without next-token prediction?

Masked language modeling: randomly mask ~15% of input tokens and train the model to predict the masked tokens from surrounding context. Bidirectional attention is essential because the model uses both sides to fill in the blank. Modern encoders also use objectives like contrastive learning for embedding-specific pretraining.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord