Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

BlogEngineering posts, releases, and field notes.ConceptsReference catalog of retrieval + LLM primitives.PlaybooksNamed failure modes with diagnostics and fixes.VersusHead-to-head against every major competitor.EvalsHow we benchmark in production conditions.

Latest from the blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Pricing

Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

Blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Beyond Binary: A New Version of the MTEB

zembed-1 vs voyage-4: Our Embedding Model Wins on Retrieval

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

Introducing zembed-1: The World's Best Text-Embedding Model

How Assembled Powers High-Quality AI Customer Support with ZeroEntropy

Prompting Best Practices For Instruction-Following Rerankers

Open-source alternatives to Cohere Rerank in 2026

Latency Performance Assessment of zerank-2

Introducing zerank-2: The Most Accurate Multilingual Instruction-Following Reranker

The Latency Myth: Why Reranking Is Still the Smartest Optimization You Can Make

Context Engineering Webinar: Everything You Missed

How Vera Health Achieved State-of-the-Art Clinical Accuracy Using ZeroEntropy

Equall Improves Legal Document Structuring and Retrieval Accuracy with ZeroEntropy

Implementing ZeroEntropy Reranking with turbopuffer Retrieval

Paper TLDR: How we trained zerank-1 with the zELO method

Mem0 Improves Memory Retrieval Accuracy with ZeroEntropy

On The Geometric Limit of Dense Single Vector Embeddings

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

My AskAI Improves Support Agent Latency and Accuracy with ZeroEntropy

Announcing ZeroEntropy's First Rerankers: zerank-1 and zerank-1-small

ZeroEntropy Raises $4.2M Seed Round to Make AI Retrieval Truly Intelligent

Improving Retrieval with ELO Scores

What is a reranker and do I need one?

Deep Dive: The Architecture of ZeroEntropy v1

AGI requires better retrieval, not just better LLMs

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Pricing Evals Sign in

Concepts / Multimodal

Topic · 13 concepts

Multimodal

When text isn't the only signal — vision, audio, and joint embedding spaces.

Multimodal models map images, audio, and video into the same representational substrate as text. The concepts below cover the architectural patterns (Vision Transformers, CLIP-style joint encoders, vision-language models that paste image tokens into an LLM context), the generation side (diffusion models, text-to-image), and the retrieval consequences (cross-modal search where a text query retrieves images, or vice versa). Multimodal is now table-stakes for any production AI product whose users hand the model a screenshot, a PDF, or a voice note.

Audio Embeddings

Audio embeddings map a waveform or spectrogram into a fixed-size vector space where similar-sounding clips land near each other. Wav2Vec 2.0, HuBERT, and BGE-Audio set the modern recipe.
CLIP

CLIP (Contrastive Language-Image Pretraining, Radford et al. 2021) is a dual-encoder model that embeds images and text into a shared vector space. It is trained contrastively on 400M (image, caption) pairs scraped from the web.
Diffusion Model

A diffusion model generates images by iteratively denoising pure Gaussian noise. The forward process gradually adds noise to a real image; the reverse process is a learned neural network that removes it step by step.
Flow Matching

A generative-modeling objective that learns a continuous vector field transporting noise to data along straight or curved probability paths. Generalizes and often replaces diffusion: simpler training, faster sampling, and the substrate behind SD3, Flux, and Veo.
Image Encoder

An image encoder maps a raw image into a sequence of patch embeddings or a pooled vector. Modern multimodal stacks use Vision Transformer (ViT) encoders that tokenize the image into 16x16 or 14x14 patches.
Multimodal RAG

Multimodal RAG is retrieval-augmented generation where the query, the documents, or both span multiple modalities — PDFs with figures, screenshots, voice queries, or image-grounded answers.
OCR (Optical Character Recognition)

OCR converts image regions containing text into machine-readable strings. Classical pipelines (Tesseract, Google Cloud Vision, AWS Textract) detect text regions then recognize them via CNN+LSTM.
SigLIP

SigLIP (Zhai et al. 2023) replaces CLIP's softmax contrastive loss with a per-pair sigmoid loss, decoupling each (image, text) pair from the rest of the batch.
Text-to-Image

Text-to-image is the generation capability where a natural-language prompt produces an image. The dominant architecture is a CLIP-conditioned latent diffusion model.
Vision Transformer (ViT)

The Vision Transformer applies a standard transformer to image patches instead of words. An image is cut into a grid of 16×16 patches, each linearly embedded into a token, fed to a transformer encoder with positional encodings.
Vision-Language Model (VLM)

A vision-language model is an LLM that can see. Image patch embeddings are projected into the LLM's token space and concatenated with text tokens; the model treats them as a uniform sequence and generates text autoregressively.
Visual Question Answering

Visual question answering (VQA) is the task of producing a natural-language answer to a question about an image. It is the canonical benchmark for vision-language models because it forces grounding.
Whisper ASR

Whisper (OpenAI, 2022) is an encoder-decoder transformer for automatic speech recognition trained on 680K hours of weakly-supervised multilingual audio.