Query Expansion

Also known as: HyDE, hypothetical document embeddings, query augmentation, PRF

TL;DR

Augmenting the original query with synonyms, paraphrases, or hypothetical answers before retrieval. The classical IR technique that LLMs reinvented as HyDE. Sometimes a clean win, sometimes drift that hurts more than it helps.

Query expansion augments the original query with additional terms, paraphrases, or even hypothetical answers before passing it to the retrieval system. The motivation is simple: the user’s query is short, ambiguous, or phrased differently from how the relevant documents are written, and the retriever can’t bridge that gap on its own.

The technique predates dense retrieval. Rocchio (1971), pseudo-relevance feedback (PRF), and the BM25 -era RM3 algorithm have been adding synonyms and morphological variants since the 1970s. The LLM-era version is more dramatic: ask a generator to write a hypothetical document that would answer the query, then embed and retrieve against the hypothetical. Same idea, much higher-dimensional intervention.

The flavors

Four patterns of query expansion live in production stacks:

Query expansion patterns

Lexical expansion — add synonyms, plurals, conjugations. Operates on the BM25 side. RM3, Bo1, KL-divergence expansion. Mature, well-tuned, predictable.
Semantic expansion — paraphrase the query into 3-5 alternate phrasings. Operates on the embedding side. Each paraphrase is embedded and retrieved separately, results merged.
Hypothetical answer (HyDE) — generate the answer, then retrieve. The answer is more lexically and structurally similar to documents than questions are. Works unreasonably well on technical queries.
Decomposition expansion — break a multi-hop query into sub-questions, retrieve each separately. Important for agentic RAG loops.

Why HyDE works (and when it stops working)

Bi-encoders trained on web (query, passage) pairs see questions on the query side and declarative passages on the document side. The two distributions are linguistically different — questions are short, interrogative; passages are long, declarative. Even a well-trained encoder leaves alignment on the floor.

HyDE bypasses this by generating a fake passage. The fake answer, even when factually wrong, looks like a document. Embedding it and retrieving against the corpus surfaces real documents that resemble the fake one. The retrieved real documents are usually correct because the LLM’s hallucinated answer at least matches the right topic.

When expansion hurts more than it helps

Query expansion is not a free lunch. Several failure modes recur:

Topic drift. Adding synonyms widens recall but can dilute precision — relevant documents stay in the candidate pool but get outranked by tangentially related ones.
Compute cost. Generating paraphrases or HyDE answers adds an LLM call per query, which is the dominant latency in low-volume queries.
Hallucination capture. LLM-generated expansion can fabricate non-existent terminology that retrieval then matches against equally fake documents.
Reranker conflict. A strong cross-encoder downstream often recovers most of the recall expansion was supposed to add. If you’re already running a reranker, expansion’s marginal contribution shrinks dramatically.

Reranking can only re-order what first-pass returned. If a relevant document falls outside the bi-encoder’s top-100, no reranker recovers it. Query expansion’s true job is to broaden the candidate set so the reranker has the right document to find — even if expansion itself doesn’t rank well, the ensemble’s final top-K improves.

The diagnostic: measure recall@500 (or whatever K your reranker consumes) with and without expansion. If recall@500 improves, expansion is doing its job; if it doesn’t move, expansion is adding noise. NDCG@10 numbers alone hide this — they reflect what the reranker did, not what was findable.

What the production rule of thumb looks like

For dense retrieval on general-domain corpora: HyDE-style expansion when LLM cost permits, lexical expansion never (the embedder already handles synonymy). For BM25 on technical corpora: lexical expansion helps; HyDE rarely does, because BM25 needs exact tokens and hallucinated paraphrases can drift away from the right vocabulary. For hybrid search : both, with a reranker downstream to filter the noise.

The honest summary: expansion is a recall lever, not a precision lever, and its value is entirely about whether your stack is recall-bound. Most production pipelines that have already added a reranker have already moved out of the recall-bound regime; in those cases, expansion is often a wash. The pipelines that are still BM25-only or bi-encoder-only on hard out-of-domain corpora are where expansion still gives a measurable lift.

Go further

What's HyDE and why did it work?

Hypothetical Document Embeddings (Gao et al., 2022) ask an LLM to generate a fake answer to the query, then embed and retrieve against that. The hypothetical answer matches the document distribution better than a question-shaped query, which closes a query-document mismatch that bi-encoders struggle with on out-of-domain data.

Embedding Dense retrieval Bi-encoder

When does query expansion hurt?

When expansion drifts the query semantics. Adding synonyms can pull in spurious matches; LLM-generated paraphrases may invent facts that don't exist in your corpus and steer retrieval toward hallucinated content. The risk grows on narrow domains where the LLM has no grounding.

Hallucination Query rewriting

Should I expand into BM25 or into the embedder?

Both, for different reasons. BM25 expansion (synonyms, morphological variants) recovers exact-token recall; the classical RM3 / Bo1 algorithms are mature and well-understood. Embedding-side expansion (paraphrases, HyDE) recovers semantic recall on out-of-distribution queries. Hybrid stacks expand on both sides and let downstream reranking sort it out.

Hybrid search BM25 Reranker

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs