Also known as: bidirectional encoder, BERT, encoder-only transformer
TL;DR
An encoder model is a transformer that reads a sequence with bidirectional attention and produces a contextual representation for each token — typically pooled into a single vector.
An encoder model is a transformer configured to read a sequence end-to-end and produce a representation — not generate new tokens. The defining feature is bidirectional attention : every position attends to every other position with no causal mask, so the representation of each token is informed by both its left and right context. BERT (Devlin et al., 2018) is the canonical example; nearly every production embedding model and reranker descends from this lineage.
What encoder output looks like
Run a string through an encoder and you get back a sequence of contextual vectors — one per input token, each in the model’s hidden dimension. Two main ways to use them:
Pool into a single vector. Average the token vectors, or take the special [CLS] token’s representation, to get one fixed-size embedding for the whole input. This is what every bi-encoder embedding model does.
Use the per-token vectors directly. For sequence labeling (NER, POS tagging) or token-level classification.
A second linear head on top turns this into whatever you need: logits for classification, a single similarity score for reranking, an embedding vector for retrieval.
During masked-language-model pretraining, BERT prepends a special [CLS] token to every sequence and uses its final-layer hidden state as the input to a next-sentence-prediction head. That trains [CLS] to aggregate information from every other token through self-attention — it’s the only token whose pretraining objective rewards summarizing the whole sequence. After pretraining, fine-tuning a classifier or embedder on top of [CLS] continues that aggregation pattern, and it works well in practice. The alternative — mean-pooling all token vectors — is competitive and often slightly better for retrieval embeddings, because mean-pooling is less brittle when fine-tuning shifts what each token attends to. Modern embedding models (E5, BGE, jina, zembed-1) overwhelmingly use mean-pooling.
Why bidirectional attention matters here
Decoder-only models use causal attention because they generate token-by-token: position must only see positions . Encoders don’t generate, so the constraint is unnecessary — and removing it is genuinely better for representation quality. The word bank in “bank account” and “river bank” only resolves with right-side context. Causal-only models can encode it via the next token’s representation, but it’s strictly less direct.
What encoders are used for
Production retrieval is built on encoders almost end-to-end:
Bi-encoder rerankers — encoder used twice (once for query, once for document) with their pooled vectors compared.
Common encoder backbones
The original BERT-base and BERT-large are mostly historical now. Today’s working encoders include RoBERTa, DeBERTa-v3, ELECTRA, E5, BGE, and the modern decoder-derived encoders (encoder-style adaptations of Qwen, Llama). For retrieval specifically, models pretrained or fine-tuned with contrastive objectives — e5, bge, jina-embeddings — outperform vanilla BERT-style backbones.
Why production AI hasn’t moved away from encoders
There is a tempting narrative that decoder LLMs subsume everything. They don’t — not for representation tasks. An encoder-based reranker runs at 10–100× less cost than asking an LLM to score a (query, document) pair, and is more accurate per dollar. Encoders are smaller, faster, easier to serve, and fit into the hot path of a query.
Decoder LLMs got famous; encoder models stayed load-bearing. Every embedding lookup, every reranker call, every faithfulness check in production AI is an encoder pretending to be invisible.
Encoder-derived models in a typical production stack
Embedding model for first-pass retrieval — encoder + mean-pooling, contrastively trained
Cross-encoder reranker — encoder over [query, doc] pairs with a regression head
Token-level NER for redaction — encoder + per-token classification head
Go further
Why use bidirectional attention instead of causal?
Encoders aren't generating; they're producing representations. To represent a word well, you want both its left and right context. 'Bank' before 'of the river' and 'bank' before 'account' should embed differently — that's only possible if the attention layer can see what's to the right.
Hugely so. Every production [embedding](/concepts/embedding/) model, every [reranker](/concepts/reranker/), every classifier, every faithfulness checker is a fine-tuned encoder. They're 10–100× cheaper to run than decoder LLMs and beat them on the narrow tasks they're trained for. Encoders are quietly the workhorses of production AI.
How are encoders pretrained without next-token prediction?
Masked language modeling: randomly mask ~15% of input tokens and train the model to predict the masked tokens from surrounding context. Bidirectional attention is essential because the model uses both sides to fill in the blank. Modern encoders also use objectives like contrastive learning for embedding-specific pretraining.