Context Compression

Also known as: context filtering, context selection, RAG compression

TL;DR

Context compression shrinks a retrieval result set or agent trace down to just the spans the LLM actually needs, before sending it to the model. Crucial for long-running agentic systems where context blows past the model's effective attention window.

LLM context windows are large but not infinite, and even when content fits the stated length, attention quality degrades sharply past the first ~10K tokens. Context compression is the family of techniques that ensure the model only sees the parts that actually matter.

The longer the model’s nominal window, the more compression actually pays off — every doubling of input scales cost linearly and attention quality sub-linearly.

The need shows up in two places:

RAG — your retriever returns 50 candidate documents, total 80K tokens. You can’t fit them all (or shouldn’t); compress to the most relevant 10K.
Long-running agents — a coding agent has a 200K-token tool-call trace. Each new step needs context, but the full trace is wasteful and costly. Compress aggressively.

Compression strategies

Reranker-based filtering — score every passage with a reranker , keep only those above a calibrated threshold. Cheap and effective.
Span extraction — instead of keeping whole documents, extract just the relevant span (a paragraph or sentence) per document. Maintains precision; loses surrounding context.
Summarization — replace each document with an LLM-generated summary. Much smaller; risk of hallucination in the summary itself.
Hierarchical — short summary always present, full text fetched on demand if the agent flags it as needed.
Specialized compression model — a small model trained specifically to identify and extract the relevance-bearing spans for a given query. Mostly a research direction; few open-weight models target this objective directly.

Why a calibrated reranker is well-suited

Reranker-based filtering only works if the scores are calibrated — otherwise “drop everything below 0.5” is meaningless because 0.5 means different things on different queries. With a calibrated reranker like zerank-2, you can set a fixed threshold globally and trust it.

The math: if the reranker is well-calibrated, “score > 0.5” means “more likely relevant than not”. For a 50-doc candidate set, this typically keeps 5-15 docs. You’ve cut your context window cost by ~5-10× with provable accuracy.

Where compression hurts

Multi-document reasoning — if the LLM needs to synthesize across many sources, compression that drops some sources will mute that signal.
Source attribution — the user wants to see all retrieved documents, not just the ones that fit the model’s context. Show all retrieved + send only the compressed subset to the model.
Aggressive thresholds — pruning too hard means the relevant doc gets dropped along with the noise. Tune thresholds against end-to-end answer quality, not against compression ratio alone.

For long-running agents — coding assistants, research agents, multi-step workflow executors — the trace is the dominant context cost by an order of magnitude. A 50-step run can blow past 200K tokens with naive concatenation; the same run with sensible compression sits at 20-30K.

The right policy is hierarchical and tiered. Recent steps stay verbatim — the agent’s last 3-5 actions usually inform the next decision and shouldn’t be summarized. Mid-distance steps collapse to a structured summary: action taken, key result, any error. Distant steps get further compressed into a goal-oriented narrative — “explored authentication module, ruled out OAuth, settled on JWT” — losing the per-step granularity but preserving the decision shape.

The mistake to avoid is uniform compression. Treating every step the same way either loses recent detail (under-compression at the head) or hallucinates structure (over-compression at the tail). Tier the policy.

Summarization replaces extracted spans with a generated paraphrase. The compression ratio is good — often 5-10× on long docs — but the paraphrase introduces a generation step, which means it can hallucinate.

The failure mode is subtle: the summary preserves the gist of the document but rewrites a quantitative detail. The original says “the firm’s Q3 revenue was 4.2B”; the summary says “the firm’s Q3 revenue was solid”. The downstream LLM, reading only the summary, can’t reconstruct the number. For analytical workflows where the LLM is supposed to cite or compute on retrieved facts, summarization is the wrong compression — span extraction (which preserves source tokens verbatim) is right.

Summarization is the right call only when the downstream task is genuinely synthesis-shaped, the source documents are too long to extract from cleanly, and the loss of verbatim phrasing is acceptable. Otherwise default to extraction.

Go further

Doesn't a long-context LLM make this obsolete?

No — even when the content fits, attention quality degrades sharply after the first ~10K tokens, and per-token cost scales linearly with whatever you send. Compression is about quality and economics, not just fitting. The longer the model's nominal window, the more compression actually pays off.

RAG Reranker

Why is calibration the hard prerequisite?

Threshold-based compression (drop everything below 0.5) is only meaningful if 0.5 means the same thing on every query. With an uncalibrated reranker your effective cutoff drifts query-to-query and you silently drop the relevant doc on some, keep noise on others.

Score calibration Pointwise scoring

What's a span-extraction model and how does it differ from a reranker?

A reranker scores whole passages; a span extractor identifies the specific sentence or paragraph within a passage that bears on the query. The output is shorter and denser, but you lose surrounding context that may matter for grounding. Production systems often use both: rerank to filter, then extract within the survivors.

Reranker Cross-encoder

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs