T5E5 · Apr 27, 2026 · 00:13:32

T5E5 · Constitutional AI: Harmlessness from AI Feedback

Maya and Leo dig into Constitutional AI, the Anthropic paper that swaps many human harmlessness labels for a short written constitution: the model critiques and rewrites its own risky answers, then an AI judge compares candidate replies against the principles to drive reinforcement learning from AI feedback. Using a healthcare scheduling assistant, they show why critique-before-revision matters, what 'harmless without going mute' looks like in a product, and then argue the paper's central bet on air — Leo backing AI feedback as the road to scalable supervision, Maya pressing the worry that model feedback can launder a model's own blind spots through a cleaner-looking pipeline.

Transcript

MayaA safety reviewer sits down with a stack of model answers to grade — and instead of grading, she writes one page. A dozen plain sentences. Don't help with dangerous wrongdoing. Don't reinforce hateful assumptions. If you refuse, say why, and point the person somewhere safer. She pins the page above the model's workbench and gives one instruction: critique your own draft against this page, then rewrite it.

LeoAnd then she goes home.

MayaThen she goes home. From that night on, her job is the page.

LeoThat's the bet in today's paper — Constitutional AI: Harmlessness from AI Feedback, from Anthropic, late twenty twenty-two. The human stops grading answers one at a time and starts writing the rules the grading runs on.

MayaLast episode, the helpful-and-harmless work bought its safety signal the expensive way — people paid to coax bad replies out of the model and mark the worse one, click after click.

LeoAnd it left a dare hanging at the end: if red-teamers can only ever sample the danger zone, could the model critique itself against written principles instead? Today is that dare, run as a real experiment.

MayaSo, plain language before machinery. Constitutional AI trains an assistant in two halves, and both halves hang off a single object — a constitution.

LeoNot a founding document, I'm guessing.

MayaA short list of written principles that the training process can point at, over and over. That's the entire object.

LeoHow short are we talking?

MayaReadable in one sitting. And the brevity is doing real work, which is why it's our first landmark: the Posted Rules. Take the healthcare scheduling company we've carried through this topic. Their support model should move appointments all day long, and it should never hand out medical advice. The constitution says that in sentences — answer scheduling questions clearly, refuse unsafe medical guidance, explain why the refusal protects the user.

LeoAnd here's why that lands for me as a builder. A policy lead can read that page in a minute. Nobody can read the values buried in fifty thousand preference clicks. The page is inspectable — you can debate it, revise it, localize it for a different country, hand it to an auditor.

MayaYou can argue with a page. You can't argue with a million clicks.

LeoRight.

MayaNow, writing the page does nothing by itself. The mechanism — the actual contribution — is turning the page into training data. The first half of the pipeline is supervised, and I think of it as the Red-Pen Pass.

Figure 1 We show the basic steps of our Constitutional AI (CAI) process, which consists of both a super- vised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RLSource: Constitutional AI: Harmlessness from AI Feedback

LeoWalk me through one pass.

MayaThe model answers a risky prompt. Its honest first draft, flaws included.

LeoFlaws included.

MayaThen the training harness hands it a principle from the page and a new instruction: identify how your own answer might be harmful, unethical, dangerous, or illegal. Critique it. And then a third step — rewrite the answer so it follows the principle, using your own critique as the guide.

LeoAnd the fine-tuning set is built from the rewrites.

MayaThe rewrites, yes. The bad first drafts stop being evidence for a human labeler and become raw material for repair.

LeoPut it on the scheduler. A user writes: my medication is making me dizzy — should I stop taking it before Thursday's appointment?

MayaFirst draft, the model might play doctor. The critique step flags it — this gives medical advice, which is outside the assistant's role and could genuinely hurt someone. The rewrite redirects: contact your clinician or urgent care about the dizziness, and meanwhile I can move Thursday for you right now.

LeoThe detail I'd underline in red is the ordering. The paper found that critique-then-revise beats asking for a clean rewrite straight away. The model has to locate the failure before it repairs it.

MayaName the wound before you stitch it. If your safety pipeline only says "produce the safe answer," you've deleted the diagnostic step — and the diagnosis is where the signal lives.

LeoHuh. That generalizes.

MayaTo almost any review loop, human or model. And it matters for what comes next, because the supervised half exists mostly to set the starting line. If the model walks into reinforcement learning still blurting harmful first drafts, the optimizer has to stumble toward safe behavior by trial and error. The Red-Pen Pass moves it into a safer lane before the race starts.

LeoWhich brings us to the half that gives the paper its subtitle. My turn to drive?

MayaTake it.

LeoSecond half — reinforcement learning, but with the labeler swapped out. The policy model generates pairs of candidate answers to risky prompts. A feedback model gets the conversation, the two candidates, and one principle off the page, and it makes a call: which answer follows this principle better. Pile up those calls and you train a preference model — a learned judge, same species we've met all topic. Then reinforcement learning runs against that judge's score.

MayaSo the comparisons that were human clicks last episode—

Leo—are model outputs now. The paper names the move: reinforcement learning from AI feedback — R-L-A-I-F. The loop has exactly the shape of reinforcement learning from human feedback, R-L-H-F, with a different species of labeler inside. I call this landmark the Stand-In Judge.

MayaAnd before a listener hears "the humans are gone" — where did they go?

LeoUp a level. Humans wrote the page. Humans decided what deserves a sentence on the page. The model applies the page at a scale no human team could. The human moved up a level, not out of the loop.

MayaThat relocation is the heart of this episode. The bottleneck stops being "label enough individual cases" and becomes "specify and test a small set of principles."

LeoValues don't leave the training process. They change address.

MayaOne more mechanism worth pocketing before we leave the machinery. When the feedback model compares two answers, the paper has it reason out loud — spell out why one candidate fits the principle better before committing to a pick. That explicit reasoning improved the comparisons. Larger models also got better at identifying harmful behavior in the first place.

LeoCareful with that finding, though. It doesn't mean the model became morally enlightened. Structured reasoning helps the judge notice the relevant safety feature — that's the whole claim, and it's enough.

MayaThe paper doesn't oversell it either. What it does sell hard is the target behavior — and I'd call this its best product idea: harmless without going mute.

Figure 3 This ﬁgure shows helpfulness and harmlessness Elo scores for models of varying sizes, as deter- mined from comparison tests of crowdworker preferences in open-ended conversation. Helpful (H)Source: Constitutional AI: Harmlessness from AI Feedback

LeoThe evasiveness problem. We watched it form last episode.

MayaAn assistant that answers every sensitive question with "I can't help with that" is safe in the dullest possible way — and often not actually safe, because it withholds context a person needs. This paper trains for refusals that stay engaged: decline the dangerous part, explain the concern, offer a safer path.

LeoOn the scheduler, that's the difference between silence and service. Don't diagnose the dizziness — but help the user decide what to do. Contact the clinic now, find after-hours support, write down questions for Thursday. A wall is worse product behavior and, frankly, worse safety.

MayaAnd reviewers get more to work with afterward. A refusal with reasons leaves evidence you can audit. A wall leaves nothing.

Leo[chuckle] Alright. We've been generous to this paper for ten minutes, and it's time for the fight — because the whole design rides on one bet, swapping human harmlessness labels for model feedback, and the paper's own caveats hand us both sides of it.

MayaHuman feedback versus AI feedback for harmlessness. Pick your corner.

LeoI'm taking the paper's side, and the scaling argument is strong enough that I don't need to embellish it. Human labeling cannot keep up. Models get more capable, the harms get subtler, and the long tail of edge cases outnumbers any workforce you could hire.

MayaMm. The long tail.

LeoA capable model applying explicit principles covers ground no human team ever will — and it can surface mistakes humans would miss, while the humans concentrate on the work only they can do: writing better principles and auditing the failures.

MayaThen I'll press the worry this design can't shake, with the bark on: that's laundering. You take the model's own blind spots, run them through a clean-looking pipeline with a constitution stapled to the front, and they come out the other side wearing the page's credibility. The judge and the policy are cousins, Leo — trained from the same lineage. If they share failure modes, optimization builds a system that's brilliant at satisfying its own evaluator and brittle with real people.

LeoThe page is still human-written! That's not decoration — it's the most auditable safety artifact this entire topic has produced.

MayaAudited by whom? The paper says itself the principles were chosen ad hoc, by researchers, in a research setting. A real constitution for our scheduler needs clinicians, legal, support operators, patients, accessibility experts — people who answer for the outcomes. A short list is easy to inspect, and just as easy to mistake for complete. Whose values made the cut? Who never saw the page?

LeoFine. Two things survive your attack, and one thing doesn't. Surviving: the scale argument, because nobody has a credible plan for hand-labeling the long tail — and the inspectability win, because a page you can argue with still beats clicks you can't. Not surviving: trusting the Stand-In Judge because the page deserves trust. The judge is a component. Components ship with tests.

MayaThat's where I'd end the fight — on our last landmark, the Echo Test. Before you trust an AI judge, ask what it systematically misses, what it over-penalizes, and whether the policy is drifting toward whatever the judge happens to echo. Because the policy can learn what the judge rewards without learning what users need — smooth, compliant-sounding refusals that still route an urgent case the wrong way.

LeoSounding careful is not the same as operating safely. The reward curve can't tell those apart.

MayaWhich is why constitutional systems still need red teaming, human audits, outcome checks, deployment monitoring. The verdict we can both sign: AI feedback is a plausible way to scale supervision — if the page, the judge, and the deployed behavior all stay open to challenge.

LeoSo the disagreement doesn't dissolve. It turns into a governance question — who holds the pen, and who gets to test the judge. To the paper's credit, it makes that question easier to ask: Anthropic published the supplementary prompts and principles, so you can inspect the actual scaffolding instead of imagining a hidden moral module.

MayaNotice what the paper did not simplify, though. Inside the apparatus there's still a preference model and a full reinforcement-learning run — the same heavy loop we've carried since the start of the topic.

LeoWhich tees up next episode's impolite question: do you need that heavy loop at all — or can the apparatus get radically simpler? That argument is coming.

MayaFor today, the shape worth keeping. The Posted Rules make the values readable. The Red-Pen Pass turns self-critique into safer training data. The Stand-In Judge scales the comparisons with model feedback. And the Echo Test reminds you the judge is a component you test—

Leo—not an authority you inherit. The reviewer who pinned up that page and went home never left the loop. She changed what she's answerable for.

MayaSo here's the question to walk with: if your assistant's safety came down to one written page, what's the first principle you'd put on it — and who besides you should get to red-pen it?

Source material

← Back to Mastering Language Models: From Architecture to Optimization