Reflection and Critique

Also known as: self-critique, self-evaluation, self-refine, reflexion

TL;DR

Reflection is the agent self-evaluation pattern: produce an answer, evaluate it against the goal or known criteria, refine if needed. It catches errors that one-shot generation misses, at the cost of extra tokens and latency.

Reflection (sometimes called self-critique, self-refine, or Reflexion) is the pattern of having a model evaluate its own output and revise it. The basic shape: generate an answer, then prompt the model (or a separate evaluator) to critique it, then optionally regenerate using the critique as feedback. Over multiple rounds, errors that one-shot generation produced get caught and corrected.

REFLECTION · SELF-CRITIQUEDraft, critique your own draft, revise.DRAFT y₀PASS 1 · GENERATECapital of Australia is Sydney.It has been the capital since 1788.Parliament sits there year-round.ONE-SHOT · NO VERIFICATIONCRITIQUEPROMPTCRITIQUEPASS 2 · EVALUATEFound 2 issues:▲1wrong cityCanberra is capital▲2wrong yearcapital since 1927SAME MODEL · NEW PROMPTREVISEREVISED y′PASS 3 · REGENERATECapital of Australia is Canberra.It has held that role since 1927.Parliament sits there year-round.ISSUES RESOLVED · 2/2LOOP · UNTIL CLEAN OR BUDGET HITSdiminishing returns past 1–2 rounds with same model.GENERATERECOGNIZEREPAIRFirst-draft answer.

Why it works (when it works)

Recognition is often easier than generation. A model that produces a buggy line of code might fail to write it correctly first try, but happily flag the bug if shown the code with a “review this” prompt. Same for arithmetic mistakes, missed cases in plans, ungrounded claims against retrieved context. Reflection exploits this asymmetry.

It also exploits a different sampling trajectory: the second pass starts from a different prompt (the original output plus a critique request), so the model isn’t anchored on its first chain of thought. aggregates multiple parallel samples; reflection is the sequential cousin.

A model and its critic share the same parametric memory and the same blind spots. If the model generates “the capital of Australia is Sydney” because that’s what its weights weakly encode, it will also affirm that fact when asked to verify — the critic uses the same recall mechanism that produced the error. Self-critique gains traction only when the critic prompt unlocks a different reasoning path: “list all reasons this might be wrong” steers the model toward error-search behavior the original prompt didn’t, similar to chain-of-thought activating different intermediate computations. Empirically, the lift from same-model critique is measurable but capped at 2-3 rounds before diminishing returns hit hard. Cross-model critique (different family, different training data, different blind spots) consistently beats same-model — error correlations across model families are weaker, so the critic is more likely to flag the generator’s mistake. Tool-grounded critique (running the code, executing the SQL) beats both — there’s no model bias to share with anyone.

Common reflection patterns

Reflection patterns
  • Generate-then-critique. Two model calls: produce, critique. Stop here, or feed the critique into a third generate call. Simple; ~2× cost; effective on coding, math, structured outputs.
  • Reflexion-style loops. The agent maintains a running self-feedback log across attempts on a task. Each attempt sees the previous critiques. Used in long-horizon agentic settings where the same task gets retried multiple times with adjustments.
  • Verifier-guided. A specialized verifier model (often distilled from the LLM’s own reasoning traces) scores candidate outputs. The original LLM picks the highest-scored. Cheaper than letting the LLM critique itself, often more reliable.
  • Tool-grounded reflection. The critique step actually runs the code, executes the SQL, or checks the citation. Best signal-to-noise — there’s no debate about whether the test passed. Closest thing to a free lunch in this space.

Where it doesn’t help

  • Subjective tasks. “Is this poem good?” — the critique is no more reliable than the generation.
  • Confidently-wrong domains. If the model is wrong in a way it doesn’t recognize as wrong, the critique will affirm the bad answer.
  • High-cost tight-latency settings. Doubling per-turn cost and latency for a marginal quality lift is a bad trade for chat UX. Reserve reflection for high-stakes turns or batch settings.

The cost-aware version

Naive reflection runs on every output and roughly doubles cost. Selective reflection runs only when a cheap confidence signal flags risk — a calibrated score for retrieval grounding, a verifier head, a distilled-classifier output. This recovers most of the quality lift at a fraction of the cost.

Cheap signal triggers expensive reasoning. The right reflection architecture is selective, not blanket.

Go further

Does reflection actually improve quality, or just cost?

It improves quality on tasks where the model can recognize errors it can't avoid producing — coding, math, multi-step planning, factual grounding against retrieved context. It does not help on tasks where the model is equally wrong about the answer and the critique (subjective writing, common-sense judgments). Empirically: useful where there's an external check or a clear criterion.

Critique with the same model, or a different one?

Different model often beats same model — error correlations are weaker across model families. The cheapest reflection setup that consistently beats no-reflection is 'big model generates, small specialized model critiques against criteria.' The big-with-itself loop also works but has diminishing returns past 1-2 rounds.

What's the cost overhead in practice?

Naive reflection roughly doubles latency and token spend per turn. Selective reflection — only run the critique when a confidence signal flags risk — recovers most of the win at a fraction of the cost. The confidence signal can be a calibrated score from a specialized model rather than the LLM itself.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord