Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

BlogEngineering posts, releases, and field notes.ConceptsReference catalog of retrieval + LLM primitives.PlaybooksNamed failure modes with diagnostics and fixes.VersusHead-to-head against every major competitor.EvalsHow we benchmark in production conditions.

Latest from the blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Pricing

Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

Blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Beyond Binary: A New Version of the MTEB

zembed-1 vs voyage-4: Our Embedding Model Wins on Retrieval

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

Introducing zembed-1: The World's Best Text-Embedding Model

How Assembled Powers High-Quality AI Customer Support with ZeroEntropy

Prompting Best Practices For Instruction-Following Rerankers

Open-source alternatives to Cohere Rerank in 2026

Latency Performance Assessment of zerank-2

Introducing zerank-2: The Most Accurate Multilingual Instruction-Following Reranker

The Latency Myth: Why Reranking Is Still the Smartest Optimization You Can Make

Context Engineering Webinar: Everything You Missed

How Vera Health Achieved State-of-the-Art Clinical Accuracy Using ZeroEntropy

Equall Improves Legal Document Structuring and Retrieval Accuracy with ZeroEntropy

Implementing ZeroEntropy Reranking with turbopuffer Retrieval

Paper TLDR: How we trained zerank-1 with the zELO method

Mem0 Improves Memory Retrieval Accuracy with ZeroEntropy

On The Geometric Limit of Dense Single Vector Embeddings

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

My AskAI Improves Support Agent Latency and Accuracy with ZeroEntropy

Announcing ZeroEntropy's First Rerankers: zerank-1 and zerank-1-small

ZeroEntropy Raises $4.2M Seed Round to Make AI Retrieval Truly Intelligent

Improving Retrieval with ELO Scores

What is a reranker and do I need one?

Deep Dive: The Architecture of ZeroEntropy v1

AGI requires better retrieval, not just better LLMs

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Pricing Evals Sign in

Concepts / Language Models

Topic · 32 concepts

Language Models

The foundational substrate of modern AI.

Large language models are transformer-based neural networks trained on vast text corpora to predict the next token. The concepts below cover the building blocks (transformer, attention, tokenization, context window), the failure modes (hallucination), and the production lens — when to call an LLM, when to specialize a small model instead, and why almost every serious AI stack ends up combining both. Foundational reading for everything else on this site.

Attention

Attention is the mechanism that lets a token in a sequence dynamically read from any other token's representation, weighted by a learned similarity.
Autoregressive Generation

Autoregressive generation is the token-by-token loop that decoder LLMs use to produce text: predict the next token from everything generated so far, sample, append, repeat.
Causal Masking

Causal masking is the lower-triangular attention mask that prevents each token from seeing tokens to its right. It is the architectural commitment that makes a transformer autoregressive — the load-bearing difference between encoder and decoder attention.
Context Rot

Context rot is the empirical degradation of an LLM's effective recall and instruction-following as its context window fills. The canonical case is the U-shaped position bias first quantified by Liu et al. (2023) as 'lost in the middle' — facts near the start and end of a long prompt are used, facts buried in the middle are often ignored — but the phenomenon generalizes to attention dilution and instruction drift across long contexts.
Context Window

The context window is the maximum number of tokens an LLM can process at once. Modern LLMs span 8K to 1M+, but the *effective* window — where attention quality stays high.
Decoder-Only Model

A decoder-only model is a transformer that generates text autoregressively, one token at a time, with causal self-attention so each position only sees prior tokens.
Encoder Model

An encoder model is a transformer that reads a sequence with bidirectional attention and produces a contextual representation for each token — typically pooled into a single vector.
Encoder-Decoder Model

An encoder-decoder model is a transformer with two stacks: an encoder reads the input bidirectionally, then a decoder generates the output autoregressively while cross-attending to the encoder's representations.
FlashAttention

FlashAttention is an I/O-aware attention kernel that tiles the computation in SRAM and fuses the softmax, avoiding the need to materialize the N×N attention matrix in HBM.
Grouped-Query Attention (GQA)

Grouped-Query Attention shares a single key/value head across a group of query heads, shrinking the KV cache by the group factor with negligible quality loss.
Hallucination

Hallucination is when an LLM generates a confident-sounding statement that's factually wrong or unsupported by the input. It's the load-bearing failure mode of LLMs in production.
KV Cache

The KV cache stores the key and value tensors from previous tokens during autoregressive generation, so each new token only computes attention over its own query against cached keys and values — not a full re-computation.
Large Language Model (LLM)

A large language model is a transformer-based neural network trained on vast text corpora to predict the next token. Modern LLMs (GPT, Claude, Gemini) are general-purpose reasoning engines.
Layer Normalization

Layer normalization rescales each layer's activations to zero mean and unit variance per token, then applies a learned affine transform. It stabilizes deep transformer training and is what lets modern LLMs reach hundreds of layers without diverging.
Logits

Logits are the raw, pre-softmax score vector a language model outputs at each position — one real-valued score per vocabulary token. They're the currency of decoding: every sampling strategy, calibration trick.
Mamba State-Space Model

Mamba is a linear-time sequence model that replaces attention with a selective state-space recurrence. It runs in O(N) instead of attention's O(N²), processes infinite context in constant memory.
Mechanistic Interpretability

Reverse-engineering neural networks at the level of *circuits* — small subgraphs of attention heads and MLP neurons that implement specific, identifiable computations.
Mixture of Experts (MoE)

An architecture that replaces the dense feed-forward layer in a transformer with a sparse routing layer over many expert subnetworks — each token activates only a few experts.
Multi-Head Attention

Multi-head attention splits the attention computation into $h$ parallel heads, each with its own learned projections. Heads specialize on different relations — syntactic, semantic, positional — and their outputs are concatenated and projected. The default attention pattern in every modern transformer.
Perplexity

Perplexity is the standard intrinsic metric for evaluating language models: the exponentiated average per-token cross-entropy loss on held-out text. Lower is better.
Positional Encoding

Positional encoding gives a transformer a sense of token order — necessary because raw self-attention is permutation-equivariant and would treat 'dog bites man' and 'man bites dog' identically.
Pretraining

Pretraining is the initial massive next-token-prediction phase that trains a language model on trillions of tokens of generic text. It's where an LLM acquires its broad capability — grammar, world knowledge, reasoning, code.
Reasoning Model

A reasoning model is an LLM trained to spend test-time compute on internal chain-of-thought before answering. The post-o1 paradigm: pretraining + SFT + RL on verifier-checkable problems, with hidden 'thinking' tokens as the substrate.
Residual Connection

A residual connection adds a layer's input to its output, so each block computes an *update* on top of a running 'residual stream' rather than transforming the representation from scratch.
RoPE (Rotary Positional Embedding)

RoPE encodes token position by rotating pairs of dimensions in the query and key vectors by an angle proportional to position. The dot product between query and key then becomes a function of their relative position.
Scaling Laws

Scaling laws are empirical power-law relationships between compute, parameter count, training tokens, and language-model loss. Chinchilla's 2022 result — train roughly 20 tokens per parameter for compute-optimal performance.
Sliding-Window Attention

Sliding-window attention restricts each token to attend only to the past $w$ tokens (typically 4K-8K) instead of the full context, trading global receptive field for $O(N \cdot w)$ instead of $O(N^2)$ compute. Used in Mistral, Gemma, Phi, and most long-context efficient designs.
Sparse Autoencoders

A wide, sparsely-activated autoencoder trained on transformer activations. The learned dictionary recovers *monosemantic* features — directions that fire for a single human-understandable concept rather than the polysemantic mush of raw neurons.
Subword Tokenization (BPE, WordPiece, SentencePiece)

Subword tokenization is the family of algorithms that learn a vocabulary of subword units from a corpus. BPE (byte-pair encoding) merges the most frequent adjacent pairs; WordPiece and Unigram are variants. tiktoken, SentencePiece, and tokenizers are the standard libraries.
Test-Time Compute

Test-time compute trades inference budget for accuracy by spending more tokens, samples, or search steps per query. Self-consistency, best-of-N, reasoning chains, and tree search are all instances. It's the substrate behind the o1 / R1 reasoning-model paradigm.
Tokenization

Tokenization is how raw text becomes numerical input for a language model — the input is sliced into tokens (sub-word units, typically 3–5 characters each), each token mapped to an integer ID.
Transformer

The transformer is the neural-network architecture behind every modern LLM, embedding model, and reranker. Its defining feature is self-attention.