Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

BlogEngineering posts, releases, and field notes.ConceptsReference catalog of retrieval + LLM primitives.PlaybooksNamed failure modes with diagnostics and fixes.VersusHead-to-head against every major competitor.EvalsHow we benchmark in production conditions.

Latest from the blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Pricing

Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

Blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Beyond Binary: A New Version of the MTEB

zembed-1 vs voyage-4: Our Embedding Model Wins on Retrieval

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

Introducing zembed-1: The World's Best Text-Embedding Model

How Assembled Powers High-Quality AI Customer Support with ZeroEntropy

Prompting Best Practices For Instruction-Following Rerankers

Open-source alternatives to Cohere Rerank in 2026

Latency Performance Assessment of zerank-2

Introducing zerank-2: The Most Accurate Multilingual Instruction-Following Reranker

The Latency Myth: Why Reranking Is Still the Smartest Optimization You Can Make

Context Engineering Webinar: Everything You Missed

How Vera Health Achieved State-of-the-Art Clinical Accuracy Using ZeroEntropy

Equall Improves Legal Document Structuring and Retrieval Accuracy with ZeroEntropy

Implementing ZeroEntropy Reranking with turbopuffer Retrieval

Paper TLDR: How we trained zerank-1 with the zELO method

Mem0 Improves Memory Retrieval Accuracy with ZeroEntropy

On The Geometric Limit of Dense Single Vector Embeddings

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

My AskAI Improves Support Agent Latency and Accuracy with ZeroEntropy

Announcing ZeroEntropy's First Rerankers: zerank-1 and zerank-1-small

ZeroEntropy Raises $4.2M Seed Round to Make AI Retrieval Truly Intelligent

Improving Retrieval with ELO Scores

What is a reranker and do I need one?

Deep Dive: The Architecture of ZeroEntropy v1

AGI requires better retrieval, not just better LLMs

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Pricing Evals Sign in

Concepts / Production

Topic · 16 concepts

Production

From notebook to live traffic.

The patterns and pitfalls when retrieval moves out of the demo and into actual production. The concepts below cover the operational discipline that separates RAG that works from RAG that breaks: latency tail behavior under bursty load, context compression before the LLM call, monitoring drift in calibrated relevance scores, and the load-balancing tradeoffs of running specialized models behind tight SLOs. These topics are less glamorous than the model-architecture material but tend to be where production ZE deployments actually win or lose.

Caching Strategies

Three layers of caching for LLM-driven systems: exact-match (request → response), prompt-prefix (KV cache reuse for shared prefixes), and semantic (similar-query reuse via embeddings). Each helps different production workloads in different ways.
Context Compression

Context compression shrinks a retrieval result set or agent trace down to just the spans the LLM actually needs, before sending it to the model. Crucial for long-running agentic systems where context blows past the model's effective attention window.
Continuous Batching

The vLLM-style scheduling trick where requests join and leave a batch in-flight, dynamically. Massively improves GPU utilization for variable-length generation compared to naive static batching, and is the default in every modern LLM serving stack.
Cost per Token

The economics primitive of LLM-driven systems. Per-token pricing — input and output, with output usually 3-5× input — is what makes a feature financially viable or not. Production decisions are dominated by this number more than any other.
Drift Detection

Monitoring distributional shift in inputs, outputs, or intermediate signals of a retrieval or LLM pipeline. The discipline that catches 'the metric is silently moving' before users notice.
LanceDB

LanceDB is an open-source vector database built on the Lance columnar format — append-only, Rust core, columnar-on-disk. It is the only OSS vector DB that handles incremental indexing without rebuilds.
Latency Tail (P95, P99)

P50 is the median; P95 and P99 are the 95th and 99th percentile latencies. The tail is what wakes oncall, not the median — a 200ms median with a 5s P99 means 1% of users see your system as broken.
LLM Observability

The operational discipline of monitoring LLM-driven systems: tracing per-call inputs/outputs, eval-in-prod against held-out sets, drift detection on inputs and outputs, latency and cost percentiles.
MDX

MDX is Markdown extended with JSX — write prose that imports components and renders them inline. The format that powers most modern docs sites (Next.js, Astro, Docusaurus).
PII Redaction

Detecting and removing personally-identifiable information from LLM inputs and outputs — names, emails, phone numbers, addresses, IDs. A classic small-model task: high-volume, narrow, latency-sensitive, with structured target output.
Pydantic

Pydantic is the runtime type-validation library that has quietly become a hard dependency of the Python ML ecosystem. You declare a `BaseModel`, get validation, JSON-schema export, and a v2 Rust core for free.
Semantic Cache

A semantic cache returns a cached LLM response when an incoming query is similar enough — by embedding cosine similarity — to a previous query, rather than requiring exact-string match.
Speculative Decoding

Use a small 'draft' model to predict the next several tokens, then have the big 'target' model verify them in a single forward pass. The standard latency-reduction trick for LLM inference — typically 2-4× faster generation at the same output quality.
Throughput (Tokens per Second)

Tokens per second per GPU is the production planning metric for LLM serving. Throughput scales with batch size up to a memory-bound ceiling, then plateaus. The key number for capacity planning, autoscaling, and unit-economic analysis.
Vector Database

A vector database is a database whose primary index is an approximate-nearest-neighbor structure over high-dimensional vectors. The system substrate for production dense retrieval — it wraps an ANN algorithm (HNSW, IVF, PQ) with persistence, replication, metadata filtering, and incremental updates.
vLLM Serving

vLLM is the dominant open-source LLM serving framework. Its core innovations — PagedAttention for KV-cache memory management, continuous batching for throughput, and prefix caching for prompt reuse.