Citation Extraction

Also known as: attribution, claim-to-source mapping, span attribution, grounding

TL;DR

Citation extraction maps each claim in an LLM-generated answer back to the supporting span in the source documents. Distinct from generation — often a small specialized model — and what makes RAG outputs auditable.

Citation extraction is the task of mapping each claim in an LLM’s answer back to the specific span in the retrieved context that supports it. Where faithfulness asks whether a claim is supported, citation extraction asks which words support it.

The output is structured: per claim, a list of (document ID, span offsets) pairs. Done well, this turns a paragraph of generated prose into an auditable trace — each sentence has a fingerprint pointing to its source.

Generation produces fluent prose; citation extraction produces precise pointers. The two skills don’t compose naturally inside a single autoregressive decode.

Why this is its own task

The instinct is to ask the generator to inline citations as it writes. Frontier models will do this, and the result is often reasonable but unreliable in characteristic ways:

The model picks a plausibly related span rather than the actually supporting one.
The model invents a citation token ([1], [doc:42]) that doesn’t correspond to a real span.
The model attributes a claim to a single span when the support is spread across multiple.
Citation tokens drift in long outputs — by paragraph 4, the indexing is off.

Even with strong prompting, inline citation accuracy hovers in the 85-95% range for top models — high enough to look right, low enough that 1 in 10-20 citations is misleading. Not good enough for legal review or clinical decision support.

The failure mode is mechanical. Inline citations are emitted as tokens during the same decode that produces prose — typically as bracketed indices like [3] or [doc:42]. The model has to maintain, in working memory, a mapping from logical position in the retrieved context to the bracketed index in the prompt. As the output grows, that mapping has to compete with everything else the model is attending to.

Empirically the slip rate climbs roughly linearly with output length: by paragraph four of a long answer, the model is often citing the wrong document index entirely, or pointing at indices that don’t appear in the prompt. The phenomenon mirrors context rot — the citation indices live in the prompt, which is exactly the part of context the model attends to least once it’s deep into generation.

Post-hoc extraction sidesteps this by running on bounded (claim, context) pairs. The extractor never has to track index state across paragraphs because each call is independent.

The simplest viable shape: a JSON object with three fields — claim (a single sentence from the generated answer), documents (a list of (doc_id, text) pairs from the retrieved context, typically the top 3-5 most relevant), and an instruction like “return the document ID and character offsets of the span that supports this claim, or null if no document supports it.”

A fine-tuned 7B model can do this at sub-100ms latency per claim. For a typical 4-sentence answer that’s ~400ms of overhead on top of generation, parallelizable across claims. The token cost is dominated by the candidate documents, but you can re-use the retrieval result — no second retrieval pass needed.

Output structure matters more than model size. Forcing the extractor to emit (doc_id, start_char, end_char) rather than free-text “the part where it says X” prevents the model from rewriting or paraphrasing the source, which is the whole point of attribution.

The post-hoc extractor pattern

The reliable pattern: generate the answer first, then run a separate citation extractor over each (claim, candidate context). The extractor is a smaller specialized model (or a structured-output frontier-model call with a tight prompt) that takes a single claim plus a small context window and returns the supporting span — or “no support found”.

Architecturally this looks like a tiny reranker over candidate spans within the context — for each candidate span, score how strongly it entails the claim, take the top-K above some threshold.

Evaluation

Citation extraction is evaluated as a binary span retrieval task:

Precision — of the spans the extractor cited, how many were actually supporting (per gold labels)?
Recall — of the gold supporting spans, how many did the extractor cite?
F1 — harmonic mean of the two. See F1 score .

For real-world contexts where exact-span matching is too strict (gold says tokens 100-150, extractor says 105-148), use IoU (intersection over union) or partial-overlap thresholds — typically IoU > 0.5 counts as a match.

This is one of the rare retrieval-adjacent tasks where the F1 score is the natural metric: the output is unordered (a set of spans), so position-weighted metrics like NDCG don’t apply.

Where citation extraction sits in a RAG pipeline

A RAG pipeline that takes citations seriously has four stages:

First-pass retrieval — get candidate documents.
Reranker — order them.
Generator — produce an answer from top-K context.
Citation extractor — map answer claims back to spans.

Stages 1-3 produce the answer; stage 4 produces the audit trail. The two are separable — you can ship stage 4 without changing stages 1-3, which is why citation extraction is often a retrofit on existing RAG systems rather than a from-scratch design choice.

For high-stakes deployments — legal, medical, regulatory — citation extraction is non-negotiable. It turns “the model said so” into “page 47, paragraph 3 says so.”

Where citation extraction is load-bearing

Legal research tools that have to point a partner at the controlling case before the answer is trusted.
Clinical decision support where every recommendation cites a guideline section.
Compliance summaries that have to map each finding back to a regulation paragraph.
Customer-support copilots where agent-facing answers expose the underlying KB article and offset.
Financial analyst workflows where every claim about an earnings report points at a 10-K page.

Go further

Why is citation extraction a separate task from generation?

Generation produces fluent prose; citation extraction produces precise span pointers. The two skills don't compose naturally — large generators tend to hallucinate citations as confidently as they hallucinate facts. Running a small specialized model that takes (claim, candidate context) → span gives you cleaner pointers and decouples auditability from the generator's quality.

Faithfulness Hallucination RAG

Can't the generator just include citations inline?

It can, and many do, but inline citations from a frontier model are still wrong on the order of 5-15% of claims — the model picks a plausibly-related span rather than the actually-supporting one. A dedicated post-hoc citation extractor running over the (claim, context) pair pushes that error rate down considerably.

LLM-as-judge Faithfulness F1 score

How is citation extraction evaluated?

Span-level precision and recall against gold-labeled supporting spans, then F1. Precision: of the citations you returned, how many were actually supporting? Recall: of the supporting spans, how many did you cite? F1 ties them together. For long documents, exact-span match is too strict — IoU or partial-overlap thresholds work better.

F1 score Precision@K Faithfulness

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs