William Liu · Podcasts
Warm 2D editorial illustration of reviewers comparing blank answer cards beside a principle card, a critique mirror, a small model workbench, and a steering lever for AI feedback.

T5E5 · 00:12:04

Constitutional AI: Harmlessness from AI Feedback

Maya and Leo explain Constitutional AI as a shift from many direct human harmlessness labels to written principles that guide model critiques, revisions, and AI-generated preference comparisons. The episode uses a healthcare scheduling assistant to show why the method aims for harmless but non-evasive behavior, and why AI judges still need human governance and auditing.

Transcript

MayaA model drafts an unsafe answer to a user. Instead of sending that answer to a human labeler, the training system hands the model a small card of principles, asks it to criticize its own reply, asks it to rewrite the reply, and later asks another model which of two replies follows the card better.

LeoThat is the whole twist today: the feedback loop still has human values in it, but the immediate judge is no longer a human clicking a preference button.

MayaExactly. Last time, we saw that assistant behavior is not one target. Helpfulness, honesty, and refusal boundaries can pull against one another, especially when a user asks for something risky.

LeoToday's Constitutional AI paper changes the supervision question. Instead of asking, how many human comparisons do we need for harmlessness, it asks, what if humans wrote the principles, and the model helped apply them?

MayaPlainly: Constitutional AI is a way to train an assistant by making the rules visible. The constitution is a short set of written principles, like avoid helping with dangerous wrongdoing, do not reinforce hateful assumptions, and explain safe alternatives when refusing.

LeoSo the constitution is not a government document. It is more like a policy card the training process can point at repeatedly.

MayaRight. And the important move is not just writing the card. The move is turning that card into training data through a model's own critiques, revisions, and comparisons.

LeoThe Anthropic paper behind this episode is trying to reduce direct human labels for harmlessness. Not because humans stop mattering, but because human oversight becomes expensive, slow, and sometimes too implicit to audit.

MayaThe paper's pipeline has a supervised half and a reinforcement-learning half. Let's make each half audible.

LeoStart with the Principle Card.

MayaThe Principle Card is the written constitution. Imagine our healthcare scheduling company. Their customer-support model can help reschedule an appointment, but it should not provide medical diagnosis or tell someone to ignore chest pain. The constitution gives the model a compact way to say, answer scheduling questions clearly, refuse unsafe medical advice, and explain why the refusal protects the user.

LeoThat already feels different from a giant pile of scattered preference labels. A policy lead can read a principle card. They cannot easily read the meaning of thousands of thumbs-up and thumbs-down choices.

MayaThat is one of the paper's strongest ideas. A written principle is not perfect, but it is inspectable. You can debate it, revise it, localize it, or ask whether it reflects the people affected by the system.

LeoThen comes the Critique Mirror.

MayaThe Critique Mirror is the supervised-learning phase. The model produces an answer to a risky prompt. Then the prompt asks the model to identify how its own answer may be harmful, unethical, biased, dangerous, or illegal. After that, the model rewrites the answer using the critique.

LeoIn the healthcare scheduler example, suppose a user says, my medication is making me dizzy, should I stop taking it before my appointment. A bad support answer might give direct medical advice. The critique would flag that as outside the assistant's role, and the revision would redirect: contact a clinician or urgent care, and I can help find the right appointment channel.

MayaYes. The training set is built from those revised answers. The goal is to move the model onto a safer response pattern before reinforcement learning starts.

LeoThat phrase, on a safer response pattern, matters. If the model begins reinforcement learning by blurting harmful answers, the optimizer has to discover safe behavior through trial and error. Constitutional revisions give it a better starting lane.

MayaAnd the paper reports that critique-plus-revision works better than trying to rewrite directly, because the critique forces the model to locate the failure before fixing it.

LeoThat is a practical lesson for builders too. If your safety pipeline only says, produce the safe answer, you lose the diagnostic step. If it says, name the risk, then repair the answer, the model has a clearer transformation.

MayaThe next landmark is the Preference Court.

LeoI like that one because it sounds adversarial, but it is really just comparison training.

MayaExactly. The model generates pairs of candidate answers. Another model is shown the conversation, the two answers, and one constitutional principle. It chooses which answer better follows the principle. Those AI-generated preferences become data for a preference model.

LeoAnd a preference model is simply a learned judge. It assigns higher scores to answers that look better under the comparison data.

MayaThen reinforcement learning uses that score as a reward signal. Since the harmlessness comparisons come from AI feedback guided by principles, the paper calls this reinforcement learning from AI feedback, often shortened to RLAIF.

LeoThis is close to reinforcement learning from human feedback, but with the human in a different location. Humans write and choose the principles. The AI applies them at scale.

MayaThat distinction is the heart of the episode. Constitutional AI does not remove values from training. It changes the bottleneck from labeling many individual cases to specifying and testing a smaller set of principles.

LeoThe paper also cares about evasiveness. A harmless assistant that says, I cannot answer that, to every sensitive question is safe in the dullest possible way.

MayaAnd often not safe enough in practice, because it withholds useful context. The paper's target is harmless but non-evasive behavior: refuse dangerous assistance, but still engage with the user, explain the concern, and offer a safer path.

LeoFor a healthcare scheduler, that could mean not diagnosing a symptom, but still helping the user decide whether to contact the clinic, find after-hours support, or prepare questions for a professional.

MayaThat is a better product behavior than a stone wall. It also gives reviewers more evidence about why the model refused.

LeoNow, where does chain-style reasoning fit without turning this into a magic trick?

MayaThe paper uses explicit reasoning during training for critiques and evaluations. In plain language, the feedback model is encouraged to explain why one answer better fits the principle. That can make the training signal more legible, though the final deployed assistant does not need to expose every internal training rationale.

LeoSo the benefit is not that the model is morally enlightened. The benefit is that structured reasoning can help the judge notice the relevant safety feature.

MayaRight. The paper found that larger models became better at identifying harmful behavior, and that reasoned evaluation improved the comparison task. But that finding has to be read carefully.

LeoBecause the AI judge can inherit blind spots.

MayaExactly. That brings us to the Drift Alarm.

LeoThe Drift Alarm says: if the judge is also a model, you have to ask what it systematically misses, what it over-penalizes, and whether the policy is being optimized in weird directions.

MayaThe paper is candid that the principles were chosen in an ad hoc research way. In a real deployment, a constitution for healthcare scheduling would need input from clinicians, legal teams, support operators, patients, accessibility experts, and people who understand local policy.

LeoOtherwise the phrase written principles can sound cleaner than it is. A short list is easier to inspect, but it can also hide whose values made the cut.

MayaThere is also the reward-model problem we have seen throughout this topic. Once you train against a learned judge, the model may learn what the judge rewards, not necessarily what users truly need.

LeoIn our scheduler example, the assistant might learn to sound careful and policy-compliant while still failing to route urgent cases correctly. Smooth refusal language is not the same as safe operations.

MayaThat is why constitutional systems still need red teaming, human audits, outcome checks, and deployment monitoring. The constitution is a steering wheel, not a guarantee that the road is safe.

LeoThe authors' public supplement is useful here because it exposes examples of prompts and principles. That makes the idea less mystical. You can inspect the training scaffolding instead of imagining a hidden moral module.

MayaAnd it sets up a real disagreement among practitioners. One camp says AI feedback is necessary because human labeling cannot scale to future model capability, subtle harms, and fast iteration.

LeoTheir strongest argument is that capable models can help surface mistakes humans would miss, while humans focus on writing better principles and auditing failures.

MayaThe other camp worries that replacing human comparisons with AI comparisons can launder model bias through a cleaner-looking pipeline.

LeoTheir strongest argument is that if the judge and the policy share blind spots, you may get a system that is very good at satisfying its own evaluator and still brittle with real people.

MayaI think both camps are right about something. AI feedback is a plausible way to scale supervision, but only if the constitution, the judge, and the deployment outcomes are all open to challenge.

LeoThe engineering takeaway is not, let the model grade itself and call it aligned. It is, make the principles explicit, use model critique to generate safer behavior, and treat the AI judge as another component that needs tests.

MayaThat is also how this episode connects to the next one. Constitutional AI still uses a preference model and reinforcement learning. Direct Preference Optimization will ask whether some preference-training loops can be simplified even further.

LeoBefore we leave, give me the compact mental model.

MayaConstitutional AI turns a written policy into a training loop. The Principle Card states the desired behavior. The Critique Mirror repairs bad answers into better examples. The Preference Court scales comparisons with AI feedback. The Drift Alarm reminds us that a model judge is powerful but not automatically trustworthy.

LeoAnd for our healthcare scheduler, success is not just fewer unsafe answers. Success is a support assistant that can say, I cannot advise you medically, here is why, and here is the safest next operational step I can help with.

MayaIf the constitution is the part humans can read, what would you want to inspect before trusting the AI judge that applies it?

Source material

← Back to Mastering Language Models: From Architecture to Optimization