Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

BlogEngineering posts, releases, and field notes.ConceptsReference catalog of retrieval + LLM primitives.PlaybooksNamed failure modes with diagnostics and fixes.VersusHead-to-head against every major competitor.EvalsHow we benchmark in production conditions.

Latest from the blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Pricing

Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

Blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Beyond Binary: A New Version of the MTEB

zembed-1 vs voyage-4: Our Embedding Model Wins on Retrieval

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

Introducing zembed-1: The World's Best Text-Embedding Model

How Assembled Powers High-Quality AI Customer Support with ZeroEntropy

Prompting Best Practices For Instruction-Following Rerankers

Open-source alternatives to Cohere Rerank in 2026

Latency Performance Assessment of zerank-2

Introducing zerank-2: The Most Accurate Multilingual Instruction-Following Reranker

The Latency Myth: Why Reranking Is Still the Smartest Optimization You Can Make

Context Engineering Webinar: Everything You Missed

How Vera Health Achieved State-of-the-Art Clinical Accuracy Using ZeroEntropy

Equall Improves Legal Document Structuring and Retrieval Accuracy with ZeroEntropy

Implementing ZeroEntropy Reranking with turbopuffer Retrieval

Paper TLDR: How we trained zerank-1 with the zELO method

Mem0 Improves Memory Retrieval Accuracy with ZeroEntropy

On The Geometric Limit of Dense Single Vector Embeddings

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

My AskAI Improves Support Agent Latency and Accuracy with ZeroEntropy

Announcing ZeroEntropy's First Rerankers: zerank-1 and zerank-1-small

ZeroEntropy Raises $4.2M Seed Round to Make AI Retrieval Truly Intelligent

Improving Retrieval with ELO Scores

What is a reranker and do I need one?

Deep Dive: The Architecture of ZeroEntropy v1

AGI requires better retrieval, not just better LLMs

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Pricing Evals Sign in

Concepts / Foundations

Topic · 48 concepts

Foundations

The bedrock primitives every other topic builds on.

The math, training mechanics, and statistical phenomena that the rest of the catalog assumes you already know. Vectors and dot products. Gradient descent and backpropagation. Cross-entropy loss and softmax. Dropout, regularization, and the bias-variance tradeoff. Generalization phenomena like double descent and grokking that don't fit the textbook story. Most readers landing on a deeper concept (info-NCE loss, knowledge distillation, layer normalization) will eventually click back here for a primitive they need to look up — which is the whole point of having a foundations layer.

Activation Function

An activation function is the elementwise nonlinearity sandwiched between the linear layers of a neural network. Without it, the whole network collapses to a single linear map.
ANOVA

Analysis of variance — the statistical test that asks 'do these groups differ more than within-group noise would predict?' Partitions total variation in the data into a between-group component and a within-group component, then compares them via the F-statistic.
Backpropagation

Backpropagation is the chain-rule application that computes the gradient of the loss with respect to every parameter in a neural network. A forward pass produces predictions.
Batch Normalization

Batch normalization standardizes each activation across the batch dimension to zero mean and unit variance, then applies a learned affine transform. Introduced by Ioffe and Szegedy in 2015, it dominated vision for years.
Batch Size

Batch size is the number of training examples averaged into a single gradient step. Larger batches give cleaner gradients but worse generalization; smaller batches are noisier but regularize implicitly.
Bayes' Rule

Bayes' rule $P(\theta \mid D) = P(D \mid \theta) P(\theta) / P(D)$ is the math of updating beliefs given evidence. Posterior ∝ likelihood × prior.
Beta Distribution

The Beta distribution $\text{Beta}(\alpha, \beta)$ is a continuous distribution on $[0, 1]$ and the conjugate prior to the Bernoulli/Binomial.
Bias-Variance Tradeoff

The bias-variance tradeoff is the classical decomposition of prediction error into three additive parts: squared bias, variance, and irreducible noise.
Brownian Motion

A continuous-time stochastic process with independent Gaussian increments. The continuous-state, continuous-time limit of a random walk — and the foundational object for stochastic calculus, diffusion models, and the noise terms in modern stochastic processes.
Cohen's d

The standardized mean difference between two groups. Cohen's d expresses the gap between two group means in units of how spread-out each group is — the most-cited effect size for two-sample comparisons.
Conjugate Prior

A prior distribution is *conjugate* to a likelihood when multiplying them produces a posterior in the same family as the prior — so Bayesian updates reduce to arithmetic on the prior's parameters instead of an integral.
Cross-Entropy Loss

Cross-entropy loss is $-\sum p \log q$ — the average number of extra nats it costs to encode samples from the true distribution $p$ using the model's predicted distribution $q$.
Double Descent

Double descent is the empirical phenomenon where test error, plotted against model size, first goes down (classical regime), then up (peaking near the interpolation threshold), then down again (modern regime).
Dropout

Dropout randomly zeroes a fraction of activations during training, forcing the network to spread its representations across many redundant paths instead of co-adapting onto a few. It is mostly off at inference.
Early Stopping

Early stopping halts training when validation loss starts climbing, even though training loss is still falling. It is the cheapest regularizer ever invented — no hyperparameter, no extra compute, no extra parameters.
Effect Size

The complement to p-values. A p-value tells you whether a difference is unlikely under the null; an effect size tells you *how big* it is. For multi-group designs, the F-statistic and η² are the workhorses.
Eigenvalue and Eigendecomposition

Eigenvalues and eigenvectors decompose a square matrix into directions of pure scaling. The resulting spectral decomposition $A = V \Lambda V^{-1}$ underpins SVD, PCA, Markov mixing time, and the low-rank circuit analyses used in mechanistic interpretability.
Entropy

Entropy $H(X) = -\sum p(x) \log p(x)$ is the average number of nats (or bits) needed to encode samples from $P$. It is the unit of uncertainty.
Epoch

An epoch is one full pass over the training set. Classic deep learning trains for tens to hundreds of epochs; modern LLM pretraining is sub-1-epoch — every token is seen exactly once. Fine-tuning typically runs 1-10 epochs.
Feedforward Network

The feedforward network — the MLP — is the per-position sub-layer that sits next to attention in every transformer block. Two linear layers with an activation in between, applied independently to each token's hidden state.
GELU

GELU is x · Φ(x), where Φ is the standard-normal CDF. A smooth, differentiable-everywhere relative of ReLU that BERT introduced and every major transformer has used since.
Gradient Clipping

Gradient clipping caps the norm of the gradient before applying the optimizer step, preventing rare but catastrophic large gradients from blowing up training. The modern default is global-norm clipping at threshold 1.0.
Gradient Descent

Gradient descent is the iterative optimization procedure that powers virtually all of deep learning. Compute the gradient of the loss with respect to parameters, take a small step in the opposite direction, repeat.
Grokking

Grokking is the training-dynamics phenomenon where a model first memorizes the training set, then much later — often suddenly — learns to generalize to held-out data.
Hankel Matrix

A matrix whose anti-diagonals are constant — each entry depends only on the sum of its indices, $H_{ij} = h_{i+j}$. The natural data structure for turning a 1D time series into a 2D matrix you can apply SVD to.
KL Divergence

KL divergence $\mathrm{KL}(P \,\|\, Q) = \sum p \log(p/q)$ measures how far one distribution is from another, in nats. It is asymmetric, non-negative, and zero only when the two distributions are identical.
Learning Rate

The learning rate is the scalar η in the gradient-descent update — how big a step to take in the direction of the negative gradient. Too high diverges, too low stalls, and getting it right is the single most important hyperparameter in training.
Markov Chain

A stochastic process where the next state depends only on the current state, not the history that led to it. The 'memoryless' property — encoded in a single transition matrix — turns multi-step prediction into matrix multiplication.
Matrix Factorization

Writing a matrix as a product of smaller or more structured matrices. SVD, NMF, QR, LU, Cholesky, eigendecomposition — same general idea under different structural constraints. Underlies essentially every low-rank method in modern machine learning.
Maximum Likelihood Estimation

MLE picks the parameters $\theta$ that maximize the probability of the observed data under the model — equivalently, that minimize negative log-likelihood. Cross-entropy training is MLE under a categorical model.
Mutual Information

Mutual information $I(X; Y) = H(X) - H(X \mid Y)$ is the reduction in uncertainty about $X$ once you observe $Y$. It is the symmetric, information-theoretic measure of how much two variables share.
Normal Distribution

The Gaussian $\mathcal{N}(\mu, \sigma^2)$ is the bell-curve density $\frac{1}{\sigma\sqrt{2\pi}} \exp(-(x-\mu)^2 / 2\sigma^2)$. It shows up everywhere because of the central limit theorem.
Optimizer

An optimizer is the wrapper around vanilla gradient descent that decides how each parameter actually gets updated. Adam, AdamW, and SGD-with-momentum are the workhorses.
Overfitting

Overfitting is the failure mode where a model memorizes its training set instead of learning patterns that generalize. It's the central concern of classical statistical learning.
Pearson Correlation

Pearson's $r = \mathrm{cov}(X, Y) / (\sigma_X \sigma_Y)$ measures the strength of a *linear* relationship between two variables, on $[-1, 1]$.
Principal Component Analysis (PCA)

PCA rotates a dataset to align with its directions of maximum variance, then projects onto the top $k$ components. Computed via SVD of the centered data matrix.
ReLU

ReLU is max(0, x) — pass positive inputs through, clamp negatives to zero. The cheap, sharp nonlinearity that made training deep networks finally work, and the dominant hidden-layer activation from 2012 until transformers switched to GELU.
Sigmoid

The sigmoid σ(x) = 1/(1 + e⁻ˣ) squashes any real number into the open interval (0, 1). It was the default neural-network nonlinearity for decades and still survives wherever you need a probability or a gate.
SiLU

SiLU is x · σ(x): the input gated by its own sigmoid. Originally proposed as Swish, now standard in Llama, Mistral, and most modern open-weight transformers. Practically indistinguishable from GELU.
Singular Spectrum Analysis (SSA)

SSA is the time-series analog of PCA. Embed a 1-D series into a Hankel trajectory matrix, SVD it, group eigentriples into trend / oscillatory / noise components, and reconstruct.
Singular Value Decomposition (SVD)

$A = U \Sigma V^T$ — every real matrix decomposes into rotation, axis-aligned stretch, and rotation. The single most-used matrix factorization in ML: powers PCA, LoRA, low-rank attention, embedding quantization, SSA, and the spectral analysis of any linear map.
Softmax

Softmax maps a vector of real numbers to a probability distribution: each output is exp(xᵢ) divided by the sum of exp(xⱼ). It is the function that turns logits into next-token probabilities and attention scores into weights.
Spearman Correlation

Spearman's `ρ` is Pearson correlation computed on *ranks* instead of raw values. It captures any monotone relationship — linear or curved — and is the correct correlation for ranking and retrieval evaluation, where what matters is order.
Tanh

Tanh maps any real number into the open interval (−1, 1). A zero-centered sibling of sigmoid that ruled hidden layers before ReLU, and that still lives in RNN cells, attention temperature tricks, and GELU's tanh approximation.
Tensor

A tensor is a multidimensional array — the rank-N generalization of scalars (rank 0), vectors (rank 1), and matrices (rank 2). In ML, 'tensor' means an n-dimensional array with a `shape`, `dtype`, and `device`.
Type Systems

A type system is the contract that says 'this variable holds an integer, that function returns a User.' Static type systems (Rust, TypeScript, mypy) catch the contract at compile-time.
Vector

A vector is an ordered list of numbers — the universal data shape in modern AI. Every embedding, every layer activation, every gradient, every prediction is a vector under the hood.
Weight Decay

Weight decay is L2 regularization on model parameters: add `λ ||θ||²` to the loss to penalize large weights. It biases the optimizer toward simpler functions and is the dominant regularizer in modern LLM training.