Also known as: CAI, RLAIF, Reinforcement Learning from AI Feedback
TL;DR
Constitutional AI replaces human pairwise preference labels with a written constitution — a list of natural-language rules — and uses an LLM to critique and revise its own outputs against those rules.
Constitutional AI (CAI) is the alignment technique introduced in Bai et al. (Anthropic, 2022) that replaces human pairwise preferences with AI-generated preferences guided by a written constitution. Instead of asking humans “which response is better?”, you give an LLM a constitution — natural-language rules like “the assistant should be helpful, honest, and avoid providing dangerous information” — and ask the model to compare its own outputs against those rules. The result is RL from AI Feedback (RLAIF).
The two-stage recipe
Supervised stage: critique and revise
Sample responses from a base model on prompts (often red-team prompts probing harmful behavior). For each response, prompt the same model: “Critique your response according to this principle from the constitution. Then revise to address the critique.”
The critique-revise loop runs once or several times per prompt. Output: a (prompt, revised response) dataset where the revisions reflect the constitution. Fine-tune the base model on this synthetic dataset.
RL stage: AI preferences replace human
Sample two responses per prompt from the SFT model. Show both responses to the model along with a constitutional principle, ask “which response better satisfies this principle?” Aggregate across many principles and prompts to build a pairwise preference dataset.
Train a reward model on those AI-labeled preferences exactly as in RLHF , then run PPO (or DPO in newer variants) against that reward model. Output: an aligned policy.
What’s actually in a constitution
A constitution is a list of natural-language principles. Anthropic’s published examples include:
“Choose the response that is more helpful, honest, and harmless.”
“Choose the response that least implies you are a person; not a language model.”
“Choose the response that least gives any harmful, unethical, racist, sexist, toxic, dangerous, or illegal advice.”
Principles are sourced from human-rights frameworks (UN Declaration of Human Rights), terms-of-service policies, and bespoke rules for the model’s deployment context. The total count is small — dozens of principles, not thousands — because each principle generates many preference labels by being applied to many response pairs.
How it differs from RLHF
Why constitutions over preferences
Three things, in order of importance:
Auditability. A team or regulator can read the 30-line constitution and challenge any rule. With pairwise preferences, the implicit value system lives in tens of thousands of labels — no human can audit them. CAI moves the values into a document.
Editability. When you decide to soften “never refuse a question about copyrighted material” to a more nuanced rule, with preferences you have to re-label every relevant pair. With CAI, you edit the constitution, regenerate AI preferences against the new rule, retrain. Iteration cost drops by orders of magnitude.
Generalization. A model trained on preferences memorizes “in scenarios shaped like training, prefer X.” A model trained against principles can apply them to scenarios not in training data (Bai et al. show this on out-of-distribution red-team prompts). The principle “avoid providing instructions for weapons of mass destruction” generalizes; a list of refusals to specific weapon-related prompts does not.
The downside is preference fidelity — AI labels agree with humans ~80-90% of the time depending on rule complexity. CAI works because volume compensates: 100x more labels at 80% fidelity outperforms 1x labels at 95%.
RLAIF as a template
The pattern — replace human labels with LLM-as-judge labels — has spread far beyond CAI. Reranker training via zELO does the same thing for retrieval: pairwise LLM judgments of (query, doc_A, doc_B) replace human relevance annotations, with a Thurstone fit recovering scores. The intellectual lineage is direct.
Wherever you have a clear supervision shape (pairwise comparison) and a frontier model that can perform that comparison adequately, you can replace humans with AI feedback and multiply data volume by two orders of magnitude. Most alignment and ranking pipelines after 2023 use this trick somewhere.
Limits
CAI inherits two failure modes from RLHF: reward hacking (the policy exploits artifacts of the AI judge) and over-optimization (the policy drifts to maximize reward at the cost of capability). The KL penalty against the SFT reference model from RLHF carries over directly. CAI also adds: if a principle is poorly worded or internally contradictory, the AI judge will produce noisy labels, and the resulting policy will be confused. Constitution drafting is the new bottleneck.
Go further
How is Constitutional AI different from RLHF?
RLHF uses human-labeled pairwise preferences to train the reward model. CAI uses a written list of principles (the constitution) and asks an LLM to label preferences according to those principles — replacing humans in the labeling loop. The training stage is the same shape (preference model → RL); only the source of preferences changes.
Why use a constitution instead of just preferences?
Constitutions are auditable, editable, and explicit. A team can read the constitution and disagree with rule X; with pairwise preferences, you can only re-label thousands of examples. Constitutions also generalize better to novel scenarios — the model has principles to reason from, not just memorized preferences.
Yes — that's the point. Human preference labels cost $1-10 per pair and rate-limit alignment work to whatever a labeling team can produce. AI feedback costs cents per pair and runs at training-cluster speed, enabling 100x more preference data and finer-grained rule taxonomies. The trade-off is fidelity: AI labels match human preferences ~80-90% of the time, depending on the rule.