Also known as: self-critique, self-evaluation, self-refine, reflexion
TL;DR
Reflection is the agent self-evaluation pattern: produce an answer, evaluate it against the goal or known criteria, refine if needed. It catches errors that one-shot generation misses, at the cost of extra tokens and latency.
Reflection (sometimes called self-critique, self-refine, or Reflexion) is the pattern of having a model evaluate its own output and revise it. The basic shape: generate an answer, then prompt the model (or a separate evaluator) to critique it, then optionally regenerate using the critique as feedback. Over multiple rounds, errors that one-shot generation produced get caught and corrected.
Why it works (when it works)
Recognition is often easier than generation. A model that produces a buggy line of code might fail to write it correctly first try, but happily flag the bug if shown the code with a “review this” prompt. Same for arithmetic mistakes, missed cases in plans, ungrounded claims against retrieved context. Reflection exploits this asymmetry.
It also exploits a different sampling trajectory: the second pass starts from a different prompt (the original output plus a critique request), so the model isn’t anchored on its first chain of thought. Self-consistency aggregates multiple parallel samples; reflection is the sequential cousin.
A model and its critic share the same parametric memory and the same blind spots. If the model generates “the capital of Australia is Sydney” because that’s what its weights weakly encode, it will also affirm that fact when asked to verify — the critic uses the same recall mechanism that produced the error. Self-critique gains traction only when the critic prompt unlocks a different reasoning path: “list all reasons this might be wrong” steers the model toward error-search behavior the original prompt didn’t, similar to chain-of-thought activating different intermediate computations. Empirically, the lift from same-model critique is measurable but capped at 2-3 rounds before diminishing returns hit hard. Cross-model critique (different family, different training data, different blind spots) consistently beats same-model — error correlations across model families are weaker, so the critic is more likely to flag the generator’s mistake. Tool-grounded critique (running the code, executing the SQL) beats both — there’s no model bias to share with anyone.
Common reflection patterns
Reflection patterns
Generate-then-critique. Two model calls: produce, critique. Stop here, or feed the critique into a third generate call. Simple; ~2× cost; effective on coding, math, structured outputs.
Reflexion-style loops. The agent maintains a running self-feedback log across attempts on a task. Each attempt sees the previous critiques. Used in long-horizon agentic settings where the same task gets retried multiple times with adjustments.
Verifier-guided. A specialized verifier model (often distilled from the LLM’s own reasoning traces) scores candidate outputs. The original LLM picks the highest-scored. Cheaper than letting the LLM critique itself, often more reliable.
Tool-grounded reflection. The critique step actually runs the code, executes the SQL, or checks the citation. Best signal-to-noise — there’s no debate about whether the test passed. Closest thing to a free lunch in this space.
Where it doesn’t help
Subjective tasks. “Is this poem good?” — the critique is no more reliable than the generation.
Confidently-wrong domains. If the model is wrong in a way it doesn’t recognize as wrong, the critique will affirm the bad answer.
High-cost tight-latency settings. Doubling per-turn cost and latency for a marginal quality lift is a bad trade for chat UX. Reserve reflection for high-stakes turns or batch settings.
The cost-aware version
Naive reflection runs on every output and roughly doubles cost. Selective reflection runs only when a cheap confidence signal flags risk — a calibrated reranker score for retrieval grounding, a verifier head, a distilled-classifier output. This recovers most of the quality lift at a fraction of the cost.
Cheap signal triggers expensive reasoning. The right reflection architecture is selective, not blanket.
Go further
Does reflection actually improve quality, or just cost?
It improves quality on tasks where the model can recognize errors it can't avoid producing — coding, math, multi-step planning, factual grounding against retrieved context. It does not help on tasks where the model is equally wrong about the answer and the critique (subjective writing, common-sense judgments). Empirically: useful where there's an external check or a clear criterion.
Different model often beats same model — error correlations are weaker across model families. The cheapest reflection setup that consistently beats no-reflection is 'big model generates, small specialized model critiques against criteria.' The big-with-itself loop also works but has diminishing returns past 1-2 rounds.
Naive reflection roughly doubles latency and token spend per turn. Selective reflection — only run the critique when a confidence signal flags risk — recovers most of the win at a fraction of the cost. The confidence signal can be a calibrated score from a specialized model rather than the LLM itself.