Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

BlogEngineering posts, releases, and field notes.ConceptsReference catalog of retrieval + LLM primitives.PlaybooksNamed failure modes with diagnostics and fixes.VersusHead-to-head against every major competitor.EvalsHow we benchmark in production conditions.

Latest from the blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Pricing

Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

Blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Beyond Binary: A New Version of the MTEB

zembed-1 vs voyage-4: Our Embedding Model Wins on Retrieval

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

Introducing zembed-1: The World's Best Text-Embedding Model

How Assembled Powers High-Quality AI Customer Support with ZeroEntropy

Prompting Best Practices For Instruction-Following Rerankers

Open-source alternatives to Cohere Rerank in 2026

Latency Performance Assessment of zerank-2

Introducing zerank-2: The Most Accurate Multilingual Instruction-Following Reranker

The Latency Myth: Why Reranking Is Still the Smartest Optimization You Can Make

Context Engineering Webinar: Everything You Missed

How Vera Health Achieved State-of-the-Art Clinical Accuracy Using ZeroEntropy

Equall Improves Legal Document Structuring and Retrieval Accuracy with ZeroEntropy

Implementing ZeroEntropy Reranking with turbopuffer Retrieval

Paper TLDR: How we trained zerank-1 with the zELO method

Mem0 Improves Memory Retrieval Accuracy with ZeroEntropy

On The Geometric Limit of Dense Single Vector Embeddings

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

My AskAI Improves Support Agent Latency and Accuracy with ZeroEntropy

Announcing ZeroEntropy's First Rerankers: zerank-1 and zerank-1-small

ZeroEntropy Raises $4.2M Seed Round to Make AI Retrieval Truly Intelligent

Improving Retrieval with ELO Scores

What is a reranker and do I need one?

Deep Dive: The Architecture of ZeroEntropy v1

AGI requires better retrieval, not just better LLMs

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Pricing Evals Sign in

Concepts / Performance Engineering

Topic · 25 concepts

Performance Engineering

Squeezing throughput, latency, and memory out of GPUs.

Modern AI is an exercise in moving tensors through GPU memory faster than anyone else. The concepts below cover the hardware abstraction (GPU memory hierarchy — HBM/SRAM/registers, arithmetic intensity, the roofline model), the parallelism strategies that let single jobs span multiple devices (tensor / pipeline / data parallel, FSDP), the inference-time tricks that compound (PagedAttention, kernel fusion, graph compilation, mixed precision), and the metrics that tell you whether you're actually using the silicon (MFU, throughput, latency tail). This is where the engineering depth of an AI team shows.

Arithmetic Intensity

Arithmetic intensity is FLOPs per byte read from memory. Combined with the hardware's compute-to-bandwidth ratio it determines whether a kernel is memory-bound (intensity below the ridge) or compute-bound (above).
AWQ — Activation-Aware Weight Quantization

INT4 weight-only quantization that protects the *salient* weight channels — the ones multiplied by large activations — by absorbing a per-channel scale into the weights *before* rounding.
CUDA Programming

NVIDIA's parallel-computing platform for GPUs. A C++ extension plus a runtime that exposes the GPU as a SIMT machine: thousands of threads grouped into warps, blocks, and grids, with explicit control over a tiered memory hierarchy.
Data Parallelism

Replicate the model on every GPU, shard the batch across replicas, and synchronize gradients with an allreduce after each backward pass. The simplest distributed-training pattern, and the default for any model that fits on a single device.
FP8

Two IEEE-style 8-bit float variants, E4M3 (1 sign, 4 exponent, 3 mantissa) and E5M2 (1 sign, 5 exponent, 2 mantissa). E4M3 has higher precision and narrower range, used for forward activations and weights.
FSDP (Fully Sharded Data Parallel)

FSDP shards parameters, gradients, and optimizer state across data-parallel ranks instead of replicating them. Each rank holds only 1/N of the weights at rest and gathers full layers on the fly during forward and backward.
GGUF and the K-Quant Family

The file format and quantization scheme that powers `llama.cpp` — the de-facto local-inference stack for LLMs on commodity hardware. GGUF embeds tokenizer, chat template, and quantized weights in a single mmap-able artifact.
GPTQ — Hessian-Based Post-Training Quantization

Layer-by-layer 4-bit weight quantization that minimizes layer-output reconstruction error using a Hessian computed from a small calibration set.
GPU Kernel Authoring

Writing custom GPU kernels — in CUDA C++, Triton, CUTLASS, or ThunderKittens — when the off-the-shelf library is leaving performance on the table.
GPU Memory Hierarchy

Modern GPUs have three relevant memory levels: HBM (slow, abundant), SRAM (fast, tiny), and registers (fastest, tiniest). HBM bandwidth is roughly 1-3 TB/s; on-chip SRAM bandwidth is closer to 20 TB/s.
Gradient Accumulation

Run multiple micro-batches sequentially, summing their gradients into a buffer, before applying a single optimizer step. Lets you simulate a large effective batch on memory-constrained hardware. The standard trick for hitting target batch sizes when a single batch won't fit on the GPU.
Gradient Checkpointing

Trade compute for memory by recomputing forward activations during the backward pass instead of storing them. Roughly 5x memory savings on activations at a cost of ~30% slower training.
Inference Graph Compilation

Capture a model's computation as a static graph, optimize it (operator fusion, constant folding, attention specialization, kernel selection), and emit a compiled artifact that runs without Python overhead. torch.compile, TensorRT-LLM, ONNX Runtime.
Kernel Fusion

Combining multiple GPU operations into a single CUDA kernel call so that intermediate tensors live in registers or shared memory instead of round-tripping through HBM.
Mixed-Precision Training

Train with bf16 or fp16 activations and weights instead of fp32, while keeping master weights and optimizer accumulations in fp32 for numerical stability.
Model FLOPs Utilization (MFU)

MFU is achieved FLOPs divided by theoretical peak FLOPs — the headline efficiency metric for whether you're actually using the GPU. Realistic targets in 2026: 40-60 percent during pretraining is good, 50 percent-plus is excellent.
Model Quantization

Compressing the weights of a trained model from fp16 / bf16 down to int8, int4, fp8, or fp4 representations to fit larger models on smaller hardware and increase inference throughput.
MXFP4

The Open Compute Project's microscaling 4-bit float. Blocks of 32 elements share a single 8-bit power-of-2 scale (E8M0); each element is a 4-bit micro-float (E2M1). Effective storage: 4.25 bits per weight.
NF4 — NormalFloat 4-Bit Quantization

A 4-bit weight format with 16 levels placed at the *equiquantiles* of the standard normal distribution rather than uniformly. Trained-network weights are approximately $\mathcal{N}(0, \sigma^2)$, so spending bits where the mass actually lives.
NVFP4

NVIDIA's variant of microscaling FP4, introduced with Blackwell. Blocks of 16 elements (vs MXFP4's 32) with an E4M3 FP8 scale (vs MXFP4's power-of-2 E8M0), plus an optional outer FP32 scale across multiple blocks.
PagedAttention

PagedAttention is the KV-cache memory manager behind vLLM. It treats the KV cache like an OS treats process memory — fixed-size blocks of 16 tokens, mapped through a per-sequence block table.
Pipeline Parallelism

Pipeline parallelism splits a model's layers across GPU stages and feeds micro-batches through them in an assembly-line schedule. GPipe (2018) introduced the basic idea; 1F1B (PipeDream, 2019) reduced its memory footprint.
PyTorch Internals

How PyTorch actually executes a forward pass: a `torch.Tensor` is a thin Python object wrapping a storage, view, dtype, and device; every op routes through a C++ dispatcher that picks the right backend kernel for the (dtype, device, layout) tuple.
Quantization-Aware Training (QAT)

Training a model with quantization simulated in the forward pass so the weights co-adapt to low-precision rounding. Recovers quality that post-training quantization loses, at the cost of a fine-tuning run. The standard recipe for sub-4-bit deployment.
Tensor Parallelism

Tensor parallelism splits each individual matrix multiplication across multiple GPUs — column-split or row-split with an all-reduce — so weights too large for one GPU's memory still produce one combined result.