Parent-Document Retrieval

Also known as: parent retriever, small-to-big retrieval, hierarchical chunking

TL;DR

Parent-document retrieval splits the index granularity from the context granularity: embed and retrieve over small chunks for precision, but return the larger parent document to the LLM. Fixes the chunk-boundary problem in RAG.

Parent-document retrieval splits the index granularity from the context granularity. Index small chunks (~200 tokens), retrieve at the small-chunk level, then return each chunk’s larger parent document (~1500 tokens) to the LLM. The retrieval scores the precise span; the generation sees the full surrounding context.

The problem it fixes

Chunking creates a tradeoff:

Small chunks are sharp for retrieval — the embedding represents one focused idea, dense-retrieval recall is high, and the reranker can isolate exactly which span matters.
Large chunks are useful for generation — the LLM gets context, picks up surrounding definitions, and doesn’t have to stitch fragments.

Pick small chunks and a definition split across two chunks means the LLM only ever sees half of it. Pick large chunks and your embedder retrieves them on too-broad a signal — the relevant sentence is buried in 1500 tokens of unrelated content. Parent-document retrieval picks both.

Index granularity and context granularity have different jobs. Small chunks win retrieval; large chunks win generation. Parent-document retrieval is the one pattern that lets you have both without paying for either tradeoff.

The mechanism

The setup is two structures on the same corpus:

Parent docs. The unit you want the LLM to see — a section, a full article, a few paragraphs together. 1000-2000 tokens is typical.
Child chunks. Smaller spans of each parent. 100-300 tokens, ~5-15 children per parent. The embeddings live at this level.

At query time:

Embed the query.
First-pass retrieval returns top-K child chunks.
Look up the parent of each retrieved child.
Deduplicate parents, optionally rerank at the parent level.
Send parents to the LLM.

Where parent-document retrieval wins

Long-form documentation — small chunks match a precise sentence; the parent gives the LLM the preceding setup and following clarification.
Legal contracts — a clause matches the query, but the LLM needs the section heading and surrounding definitions to interpret it correctly.
Multi-step technical answers — the answer is one paragraph but only makes sense alongside the three paragraphs around it.
Code search — a function definition matches, but the LLM needs the file context (imports, class scope) to use it correctly.

Implementation notes

The pattern is composable with the rest of the stack:

Reranking at the parent level is the common choice — child-level retrieval is wide enough to allow a meaningful rerank on the deduplicated parents.
Reranking at the child level can be useful when you want fine-grained precision on which spans triggered the match, e.g. for highlighting or citation.
Hybrid first-pass. Run BM25 and dense retrieval over the children, fuse, then expand to parents. The fusion stays at the level where the signal is sharpest.

Three cases. (1) Self-contained chunks — if your corpus is already a list of FAQ answers or product descriptions, each chunk is the right context unit and the parent expansion is wasted tokens. (2) Token budget pressure — if the LLM context is already saturated, sending parents instead of children multiplies the cost per query and the LLM ends up losing important spans to context rot . (3) Highly structured data — for tabular content, JSON, or strict schemas, the right unit is the row or record, not a hierarchical parent.

For long-form prose, technical documentation, and legal text, parent-document retrieval is almost always a net win. For everything else, measure before adopting.

Cost shape

Parent-document retrieval costs roughly:

Index storage: same as small-chunk indexing — you only embed the children.
Retrieval: same cost as small-chunk retrieval; the parent lookup is one ID-to-doc map per result, negligible.
LLM tokens: higher than small-chunk RAG by the parent-to-child size ratio. A 1500-token parent vs a 200-token child is ~7.5× the tokens per result. Send fewer results to compensate.

The trade is “more tokens per result, fewer results” — typically a strict improvement on faithfulness and answer quality for the same total token budget, because each result the LLM sees is more complete.

Go further

How big should the chunks and parents actually be?

A common shape is 200-token chunks under 1500-2000-token parents — small enough that embeddings stay focused, large enough that the LLM gets the surrounding context. The exact numbers depend on your embedder's training distribution (most are trained on 256-512 token spans) and your downstream LLM's context budget.

Chunking Context window

What happens when multiple child chunks point to the same parent?

Deduplicate before sending to the LLM. If chunks 3, 7, and 9 from the first-pass all belong to parent doc 42, you send parent 42 once, not three times. Most production implementations also bump the parent's effective relevance score when multiple children hit — three matches inside one doc is a stronger signal than three matches scattered across three docs.

First-pass retrieval Reranker

Is this the same thing as recursive chunking?

Related but distinct. Recursive chunking is one way to build the parent/child hierarchy — split the doc into sections, then sections into paragraphs, then paragraphs into sentences. Parent-document retrieval is the retrieval pattern of querying the smallest level and returning a larger level. You can do parent-document retrieval over any two-level chunking scheme, recursive or flat.

Chunking RAG

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs