- LLM pipelines are bottlenecked by context size — most of the text you feed an LLM is irrelevant to the task at hand.
- zerank-2’s calibrated scores turn the reranker into a binary classifier: score each chunk against a query, threshold at a chosen cutoff, and keep only what’s relevant.
- In a real healthcare pipeline, this compressed 150-page clinical documents down to 3–10 relevant pages per criterion — 85%+ compression with 90%+ recall.
- The pattern generalizes beyond RAG: document routing, duplicate detection, content moderation, and multi-label classification — all at 50–100x less cost than LLM classification.
Large language models are expensive to run — and the cost scales directly with how much text you put in the context window. In many real-world pipelines, the bottleneck isn’t the LLM’s reasoning ability; it’s the sheer volume of context you have to provide before the LLM can reason at all.
This is where ZeroEntropy’s zerank-2 reranker opens up a pattern that goes well beyond traditional search: using a reranker as a calibrated binary classifier to decide, page by page or chunk by chunk, what actually belongs in your LLM’s context.
What Is zerank-2?
zerank-2 is ZeroEntropy’s state-of-the-art multilingual cross-encoder reranker. Cross-encoders differ from embedding models in a fundamental way: rather than independently encoding a query and a document into vectors and comparing them, a cross-encoder reads the query and the document together and outputs a single relevance score. This joint attention makes cross-encoders substantially more accurate — at the cost of being slower for large-scale retrieval, which is why they are typically applied as a second-stage reranker on a shortlist of candidates.
zerank-2 pushes the state of the art on several axes:
- Instruction-following: The model accepts a natural-language instruction alongside the query, letting you inject domain context, terminology, or custom ranking criteria. A healthcare query for “acute kidney injury” can be told to treat “AKI” as a synonym, or to prioritize lab values over clinical notes.
- Calibrated scores: This is the key property for the use case in this post. The model is trained so that a score of 0.8 means approximately 80% relevance — consistently, across query types and domains. The score is not just a relative ranking signal; it carries absolute probabilistic meaning.
- Multilingual: Trained across 100+ languages with near-English performance, including challenging scripts and code-switching queries.
- Fast and cheap: At $0.025 per 1M tokens — half the price of Cohere Rerank 3.5 — and with p50 latency around 130ms, it fits comfortably into production pipelines.
The Core Insight: A Reranker Score Is a Relevance Probability
Most people use rerankers to sort a list. But zerank-2’s calibrated scores unlock a different usage pattern: thresholding.
For any query-document pair, zerank-2 produces a score in [0, 1]. Because of the calibration guarantee, you can interpret this score as a probability: “how likely is it that this chunk is relevant to this query?”
That turns the reranker into a binary classifier:
score >= threshold → relevant ✓ (include in context)
score < threshold → not relevant ✗ (discard)
Use Case: Context Compression for Long Clinical Documents
To make this concrete, consider the problem ZeroEntropy tackled with a leading healthcare company that automates clinical review of prior authorization requests.
The setup: A prior authorization request is a lengthy document — often 100–200 pages of clinical notes, lab results, imaging reports, and physician letters. A clinical reviewer must assess whether the patient meets a set of coverage criteria, each expressed as a structured question:
- “Is there documentation of a failed trial of first-line therapy?”
- “Does the patient have a confirmed diagnosis of moderate-to-severe disease?”
- “Are there contraindications documented for alternative treatments?”
There may be 50–100 such criteria per review, and the relevant evidence for each criterion is scattered across just a few pages of the full document.
The naive approach: Send the entire document to an LLM for each criterion. A 150-page document at ~2,000 characters per page is ~300,000 characters of context — multiplied across 80 criteria, that is 24 million characters fed to the LLM per case. At scale this is both slow and prohibitively expensive.
The zerank-2 approach: Score every page of the document against each criterion question. Keep only the pages that score above a threshold (or the top-K pages). Send only those to the LLM.
from zeroentropy import AsyncZeroEntropy
zclient = AsyncZeroEntropy()
async def score_pages_for_criterion(
pages: list[str],
criterion_question: str,
batch_size: int = 10,
) -> list[float]:
"""Score each page's relevance to a clinical criterion."""
all_scores: list[float] = [0.0] * len(pages)
for i in range(0, len(pages), batch_size):
batch = pages[i : i + batch_size]
response = await zclient.models.rerank(
model="zerank-2",
query=criterion_question,
documents=batch,
)
for result in response.results:
all_scores[i + result.index] = result.relevance_score
return all_scores
def select_relevant_pages(
pages: list[str],
scores: list[float],
threshold: float = 0.4,
) -> list[str]:
"""Return only pages that exceed the relevance threshold."""
return [page for page, score in zip(pages, scores) if score >= threshold]
The result is a compressed context containing only the pages with evidence relevant to that specific criterion — typically 3–10 pages instead of 150. The LLM then reasons over a manageable, high-signal context.
The Recall vs. Context Size Tradeoff
The key question is: how much context do you need to preserve before you’ve captured essentially all the relevant evidence?
The curve has a characteristic shape: recall rises steeply at first, then flattens as you include more pages. In practice, the top 10–20 pages by zerank score capture the vast majority of ground-truth relevant pages across all criteria — often 90%+ recall at less than 15% of the total document characters.
You can discard 85%+ of document characters while preserving 90%+ of the relevant content.
Setting the Threshold
Choosing a threshold is a one-time calibration step, and zerank-2’s calibrated scores make it interpretable rather than arbitrary.
Option 1: Fixed threshold based on semantics
Because zerank-2’s scores are calibrated probabilities, you can pick a threshold with direct semantic meaning:
| Threshold | Meaning |
|---|---|
| 0.2 | Include anything with 20%+ chance of relevance (high recall, lower precision) |
| 0.4 | Balanced — good default for most RAG pipelines |
| 0.6 | Include only clearly relevant content (high precision, some recall loss) |
| 0.8 | Very conservative — near-certain relevance only |
Option 2: Calibrate on a labeled validation set
If you have ground-truth labels, you can directly optimize the threshold against recall:
def find_threshold_for_recall_target(
scores_per_criterion: list[list[float]],
ground_truth_pages: list[list[int]],
target_recall: float = 0.95,
) -> float:
"""
Find the lowest threshold that achieves target_recall
on a labeled validation set.
"""
best_threshold = 0.0
for threshold in [t / 100 for t in range(0, 100)]:
recalls = []
for scores, gt_pages in zip(scores_per_criterion, ground_truth_pages):
selected = {i + 1 for i, s in enumerate(scores) if s >= threshold}
gt = set(gt_pages)
recall = len(selected & gt) / len(gt) if gt else 1.0
recalls.append(recall)
avg_recall = sum(recalls) / len(recalls)
if avg_recall >= target_recall:
best_threshold = threshold
break
return best_threshold
This gives you a principled threshold tied to a specific recall guarantee — for example, “include all content needed to answer 95% of criteria correctly.”
Option 3: Top-K instead of threshold
If your downstream pipeline has a hard context budget (e.g. a 32K token limit), use top-K selection rather than a threshold:
def select_top_k_pages(
pages: list[str],
scores: list[float],
k: int = 10,
) -> list[str]:
ranked = sorted(enumerate(scores), key=lambda x: -x[1])
top_indices = {i for i, _ in ranked[:k]}
# Return in original order to preserve document flow
return [page for i, page in enumerate(pages) if i in top_indices]
Beyond Context Compression: General Binary Classification
The context compression use case is specific to RAG, but the underlying pattern — zerank-2 as a binary classifier — generalizes broadly.
Document Routing
In a multi-stage pipeline, use zerank to decide which documents warrant expensive downstream processing. Score each document against a query or policy description; route only those above 0.5 to an LLM for detailed analysis.
Duplicate / Near-Duplicate Detection
Frame it as a relevance query: “Is this document essentially the same as the reference?” A high score flags near-duplicates.
Content Moderation / Policy Compliance
Score content against a policy description query. “Does this text contain instructions that could cause harm?” with a low threshold catches borderline cases for human review.
Multi-Label Classification
Run one zerank call per label/class. Each call is cheap; the calibrated scores give you a probability per class that you can threshold independently.
Here’s a multi-label classification example in practice:
async def classify_document(
document: str,
class_descriptions: dict[str, str],
threshold: float = 0.5,
) -> dict[str, bool]:
"""
Classify a document against multiple categories using zerank-2.
Each category is described as a natural language query.
"""
results = {}
for class_name, description in class_descriptions.items():
response = await zclient.models.rerank(
model="zerank-2",
query=description,
documents=[document],
)
score = response.results[0].relevance_score
results[class_name] = score >= threshold
return results
# Example: clinical document triage
categories = {
"contains_lab_results": "laboratory test results, blood work, or diagnostic measurements",
"contains_imaging": "radiology report, MRI, CT scan, X-ray, or imaging findings",
"contains_diagnosis": "diagnosis, clinical assessment, or documented medical condition",
"contains_medication": "medication, prescription, dosage, or drug therapy",
}
labels = await classify_document(clinical_note, categories, threshold=0.4)
Putting It Together
The pattern zerank-2 enables in these pipelines is simple but powerful:
Express your classification task as a relevance query
Score your candidates
Apply a threshold
Send only what matters to the LLM
zerank-2 is not a replacement for LLMs. It is a fast, cheap filter that makes LLMs more effective by ensuring they spend their compute on content that actually matters.
Get Started
ZeroEntropy offers a free 2-week trial with 1,000 queries.
from zeroentropy import AsyncZeroEntropy
client = AsyncZeroEntropy() # uses ZEROENTROPY_API_KEY env var
response = await client.models.rerank(
model="zerank-2",
query="your query or criterion here",
documents=["chunk 1 text", "chunk 2 text", "..."],
)
for result in response.results:
print(f"Document {result.index}: score={result.relevance_score:.3f}")SOC 2 Type II and HIPAA-ready cloud options available for regulated industries.
