LLM context windows are large but not infinite, and even when content fits the stated length, attention quality degrades sharply past the first ~10K tokens. Context compression is the family of techniques that ensure the model only sees the parts that actually matter.
The longer the model’s nominal window, the more compression actually pays off — every doubling of input scales cost linearly and attention quality sub-linearly.
The need shows up in two places:
- RAG — your retriever returns 50 candidate documents, total 80K tokens. You can’t fit them all (or shouldn’t); compress to the most relevant 10K.
- Long-running agents — a coding agent has a 200K-token tool-call trace. Each new step needs context, but the full trace is wasteful and costly. Compress aggressively.
Compression strategies
Compression strategies
- Reranker-based filtering — score every passage with a reranker , keep only those above a calibrated threshold. Cheap and effective.
- Span extraction — instead of keeping whole documents, extract just the relevant span (a paragraph or sentence) per document. Maintains precision; loses surrounding context.
- Summarization — replace each document with an LLM-generated summary. Much smaller; risk of hallucination in the summary itself.
- Hierarchical — short summary always present, full text fetched on demand if the agent flags it as needed.
- Specialized compression model — a small model trained specifically to identify and extract the relevance-bearing spans for a given query. Mostly a research direction; few open-weight models target this objective directly.
Why a calibrated reranker is well-suited
Reranker-based filtering only works if the scores are calibrated — otherwise “drop everything below 0.5” is meaningless because 0.5 means different things on different queries. With a calibrated reranker like zerank-2, you can set a fixed threshold globally and trust it.
The math: if the reranker is well-calibrated, “score > 0.5” means “more likely relevant than not”. For a 50-doc candidate set, this typically keeps 5-15 docs. You’ve cut your context window cost by ~5-10× with provable accuracy.
Where compression hurts
- Multi-document reasoning — if the LLM needs to synthesize across many sources, compression that drops some sources will mute that signal.
- Source attribution — the user wants to see all retrieved documents, not just the ones that fit the model’s context. Show all retrieved + send only the compressed subset to the model.
- Aggressive thresholds — pruning too hard means the relevant doc gets dropped along with the noise. Tune thresholds against end-to-end answer quality, not against compression ratio alone.
For long-running agents — coding assistants, research agents, multi-step workflow executors — the trace is the dominant context cost by an order of magnitude. A 50-step run can blow past 200K tokens with naive concatenation; the same run with sensible compression sits at 20-30K.
The right policy is hierarchical and tiered. Recent steps stay verbatim — the agent’s last 3-5 actions usually inform the next decision and shouldn’t be summarized. Mid-distance steps collapse to a structured summary: action taken, key result, any error. Distant steps get further compressed into a goal-oriented narrative — “explored authentication module, ruled out OAuth, settled on JWT” — losing the per-step granularity but preserving the decision shape.
The mistake to avoid is uniform compression. Treating every step the same way either loses recent detail (under-compression at the head) or hallucinates structure (over-compression at the tail). Tier the policy.
Summarization replaces extracted spans with a generated paraphrase. The compression ratio is good — often 5-10× on long docs — but the paraphrase introduces a generation step, which means it can hallucinate.
The failure mode is subtle: the summary preserves the gist of the document but rewrites a quantitative detail. The original says “the firm’s Q3 revenue was 4.2B”; the summary says “the firm’s Q3 revenue was solid”. The downstream LLM, reading only the summary, can’t reconstruct the number. For analytical workflows where the LLM is supposed to cite or compute on retrieved facts, summarization is the wrong compression — span extraction (which preserves source tokens verbatim) is right.
Summarization is the right call only when the downstream task is genuinely synthesis-shaped, the source documents are too long to extract from cleanly, and the loss of verbatim phrasing is acceptable. Otherwise default to extraction.