Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

BlogEngineering posts, releases, and field notes.ConceptsReference catalog of retrieval + LLM primitives.PlaybooksNamed failure modes with diagnostics and fixes.VersusHead-to-head against every major competitor.EvalsHow we benchmark in production conditions.

Latest from the blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Pricing

Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

Blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Beyond Binary: A New Version of the MTEB

zembed-1 vs voyage-4: Our Embedding Model Wins on Retrieval

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

Introducing zembed-1: The World's Best Text-Embedding Model

How Assembled Powers High-Quality AI Customer Support with ZeroEntropy

Prompting Best Practices For Instruction-Following Rerankers

Open-source alternatives to Cohere Rerank in 2026

Latency Performance Assessment of zerank-2

Introducing zerank-2: The Most Accurate Multilingual Instruction-Following Reranker

The Latency Myth: Why Reranking Is Still the Smartest Optimization You Can Make

Context Engineering Webinar: Everything You Missed

How Vera Health Achieved State-of-the-Art Clinical Accuracy Using ZeroEntropy

Equall Improves Legal Document Structuring and Retrieval Accuracy with ZeroEntropy

Implementing ZeroEntropy Reranking with turbopuffer Retrieval

Paper TLDR: How we trained zerank-1 with the zELO method

Mem0 Improves Memory Retrieval Accuracy with ZeroEntropy

On The Geometric Limit of Dense Single Vector Embeddings

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

My AskAI Improves Support Agent Latency and Accuracy with ZeroEntropy

Announcing ZeroEntropy's First Rerankers: zerank-1 and zerank-1-small

ZeroEntropy Raises $4.2M Seed Round to Make AI Retrieval Truly Intelligent

Improving Retrieval with ELO Scores

What is a reranker and do I need one?

Deep Dive: The Architecture of ZeroEntropy v1

AGI requires better retrieval, not just better LLMs

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Pricing Evals Sign in

Concepts / Data

Topic · 18 concepts

Data

The corpora, curation, and quality decisions that make models possible.

Training data is the single largest determinant of model quality, and most of what's interesting about modern AI traces back to choices made at the data layer. The concepts below cover the canonical pretraining corpora (Common Crawl, FineWeb, the closed-source frontier datasets), the curation steps that separate good data from web sludge (deduplication, quality filtering, contamination checks), and the human/programmatic labeling layers (RLHF preferences, weak supervision, synthetic data) that produce post-training signal. If you're building a model, an embedding, or a reranker, you'll spend more engineering hours here than anywhere else.

Common Crawl

Common Crawl is the largest open web crawl — roughly 250 billion pages across 100+ monthly snapshots, distributed as WARC files. It's the raw material almost every public LLM corpus is built from.
Data Augmentation

Synthetic perturbation of training examples to expand a dataset's effective size — paraphrase and back-translation for text, rotation and crop for images, in-batch contrastive views for embeddings.
Data Contamination

When evaluation data leaks into training data, inflating benchmark scores without improving real capability. Detected via n-gram match against eval sets, log-probability attacks, or membership inference.
Data Curation

The umbrella discipline of preparing a corpus for training: filtering, deduplication, quality scoring, language ID, and classifier-based selection.
Data Engineering at Scale

What changes when your dataset doesn't fit on one machine, doesn't fit in RAM, and takes hours per pass. The thresholds where ad-hoc Python stops working: single-file → sharded, RAM → streaming, single-node → distributed.
Data Formats

Data formats are the contracts between memory and disk: how a structured record turns into bytes that can be read back later. Choice of format determines whether you can scan a TB in a second or an hour, whether schema evolution breaks readers.
Data Labeling

Human-in-the-loop annotation of training data — crowdsourced (Mechanical Turk, Scale, Surge), expert (domain specialists), and gold-standard sets. Distinct from RLHF preferences.
Data Mixing

The ratio decisions in a pretraining corpus — what fraction of web vs code vs math vs books vs scientific papers. Second-most-important choice in pretraining after corpus selection itself.
Dataset Cards

Structured metadata describing a dataset's provenance, license, size, intended use, limitations, and ethical considerations. The HuggingFace dataset-card schema is the de facto standard, and every shipped dataset should have one.
Deduplication

Removing exact and near-duplicate documents from a training corpus. Exact dedup uses content hashing; near-dedup uses MinHash, SimHash, or embedding-based similarity.
Distributed Data Processing

Spark, Ray Data, Beam, Dask — the frameworks that turn N nodes into one logical compute. The map-reduce mental model still rules: per-partition compute is free, cross-partition compute requires a shuffle, and shuffle is the dragon.
FineWeb

FineWeb is a 15-trillion-token English web corpus released by HuggingFace in 2024, distilled from 96 Common Crawl snapshots through an aggressive filtering and deduplication recipe.
JSONL

JSONL — JSON Lines, also called NDJSON — is one JSON object per line. Brutally simple, ubiquitous in ML datasets and log shipping, friendly to streaming and append-only writes.
Parquet

Parquet is a columnar on-disk format that has become the only reasonable way to store multi-TB datasets. Rows are grouped into chunks; each chunk stores columns separately, compressed, with statistics.
Pretraining Corpus

The trillion-token text mixture that an LLM consumes during pretraining. Composition (web, code, books, math, scientific papers) and mixing ratios are the two highest-leverage choices in pretraining.
Streaming Datasets

When your dataset doesn't fit on local disk, you stream it from object storage as tar or Parquet shards. Sequential reads of large objects are 10-100x faster than random access on S3, so streaming formats.
Weak Supervision

Programmatic labeling: write rules, heuristics, and labeling functions, aggregate them into noisy labels for a model to denoise. The Snorkel paradigm.
Web Scraping

The engineering pipeline for harvesting text data from the public web — crawlers, robots.txt, JS rendering, deduplication-as-you-go, rate limits, and politeness.