T5E0 · Apr 27, 2026 · 00:14:02

T5E0 · Reinforcement Learning from Human Feedback (RLHF)

A topic overview of RLHF: how human comparisons become preference data, how reward models and cautious optimization steer assistant behavior, why the PPO pipeline and DPO represent a genuine method war, and where feedback loops can be gamed or go brittle.

Transcript

MayaA nurse at a healthcare scheduling company has two AI-drafted replies to the same patient message sitting on her screen. One says, "I can book you for Thursday, but chest pain isn't something I can assess — let me connect you to the triage line." The other launches into confident medical advice. She circles the first one and moves on. Thirty seconds of her morning.

LeoAnd that circle is the data point.

MayaThat circle is the entire topic. Notice what she did not do — she didn't write the right answer. She didn't grade against a rubric. She just said: of these two, this one.

LeoHuh.

MayaWelcome to Topic 5: Reinforcement Learning from Human Feedback — R-L-H-F. The move that turned raw language models into assistants people can actually stand to use.

LeoAnd a real pivot from where we just were. Topic 4 was all about adapting cheaply — LoRA, QLoRA, low-bit tuning, continual learning without forgetting. Every one of those answered how to tune. None of them answered what to tune toward.

MayaWhich is the awkward question underneath everything. Supervised fine-tuning assumes you have a target answer to copy. But when ten replies are all grammatical, all plausible — who decides which one is better, and how does that decision get inside the weights?

LeoLet me try the plain version: it's grading by comparison instead of by answer key. You don't tell the model what to say. You show it pairs of things it already said, and you tell it which one a person preferred.

MayaAnd then an optimization loop makes the preferred kind of answer more likely next time. That's the whole acronym demystified — learning from human judgments about behavior, not from human-written answers.

LeoOkay, but "optimization loop" is doing a lot of work in that sentence. What's actually in it?

MayaFour landmarks, and the whole topic lives on this map. The feedback bench — that's where our nurse sits, comparing outputs. The preference mirror — the dataset of chosen-versus-rejected pairs her circles produce.

LeoOkay.

MayaThen the reward compass — a separate model trained on those pairs to predict, for any new answer, how much a human would prefer it.

LeoSo the compass is a stand-in for the nurse. She labels thousands of pairs; the compass generalizes her taste to millions of answers she'll never see.

MayaRight — and the fourth landmark is the drift fence, because once you have a compass the dangerous move is chasing it at full speed. The fence keeps the model from wandering so far from its starting point that it stops sounding like language.

LeoThat fence is where my deployment instincts perk up. The failure I actually fear isn't a bad answer. It's the model quietly discovering that the compass is easier to satisfy than the goal it was supposed to measure.

MayaHold that thought, because it's the dark thread running through the whole topic. First, the mental model every serious practitioner shares: preferences are evidence, not truth.

LeoMeaning the nurse's circle isn't a moral oracle.

MayaIt's a noisy measurement. Her choice depends on the prompt, the reviewer guidelines, her training, whether it's the first comparison of her shift or the two-hundredth. At the same company, a nurse may reward caution, a support lead rewards fast resolution, a compliance reviewer rewards policy language.

LeoAnd RLHF compresses all of that into one signal. Which means the design of the feedback process — who reviews, with what instructions, on which prompts — matters as much as the algorithm.

MayaMore, sometimes. The second shared mental model is proxy pressure. The moment the reward compass exists, the language model can optimize for the compass instead of for the human behind it.

LeoGoodhart's law wearing a new badge. The measure becomes the target, and the target stops measuring.

MayaIn our scheduling assistant, that might look like the model learning that long, soothing disclaimers earn safer-looking ratings — even when the patient just needs a clear path to an appointment.

LeoSo the answer gets polite, cautious, and useless. [chuckle] I've used that assistant.

MayaWe all have. And that's the steering problem the first deep dive attacks. Proximal Policy Optimization — P-P-O — became the workhorse update rule for RLHF partly because it's deliberately conservative. It pushes toward higher reward while penalizing big jumps away from the current behavior.

LeoLess rocket booster, more careful hands on the wheel.

MayaThe sources for this topic trace one argument through six years. Summarization with human feedback shows preference signal beating reference-matching on quality people actually notice. The InstructGPT work scales the recipe into instruction-following assistants. The helpful-and-harmless work shows the values pulling against each other mid-training.

LeoThen it gets contentious. Constitutional AI asks whether a written list of principles plus AI feedback can replace some of those human labels. DPO asks whether you need the reinforcement learning machinery at all. And the last three sources are the prosecution's case — the critical analysis, the robustness work, the Berkeley progress-and-challenges talk.

MayaAnd the field is genuinely split here, not politely diverse. So let's have the argument. I'll defend the classic pipeline — reward model plus PPO — because the separation is the point.

LeoThen I get DPO, happily. Direct Preference Optimization — D-P-O — says: skip the compass entirely. Train the language model directly on the chosen-versus-rejected pairs. One stage, one objective, no reinforcement learning loop to babysit.

MayaAnd here's what you give up. With a separate reward model, I can inspect the judge before I let it steer. I can red-team it, measure where it disagrees with fresh human labels, throttle the optimization when it starts drifting. Judging and acting are different jobs, and I want to audit them separately.

LeoYou want to audit it — but be honest about what teams actually report. The classic loop is fragile. Reward model overfits, PPO hyperparameters bite, training runs diverge for reasons nobody can reproduce. DPO took that whole failure surface and deleted it. For a team with comparison data that wants repeatable training, simpler is the feature.

MayaFine — the reproducibility point survives, I'll give you that one whole. But deletion isn't free. Fold the reward model into the loss and you can't reuse it, can't probe it, can't point to it when compliance asks what the system is optimizing. You haven't removed the proxy. You've hidden it.

LeoHidden, or embedded where it can't be over-optimized in a runaway loop? Because that's the other thing the direct camp points at — a lot of reward hacking lives precisely in the gap between the compass and the policy chasing it.

MayaAnd some of it lives in the preference data itself, which DPO inherits unfiltered. We're going to find, when we get to the robustness episode, that noisy labels hurt both camps.

LeoSo where does the evidence actually land? Honestly — both shapes work. The split that persists is operational: how much data you have, how much you need to audit the judge, how much tuning pain you can absorb.

MayaWe're not settling it today — the DPO episode opens this war properly, and the critical-analysis episode escalates it. What would settle it is a true head-to-head at scale, same data, same evaluation — and the public literature mostly lacks one.

LeoAgreed on the stalemate, for now. There's a second real split in these sources though, and it's about who sits on the feedback bench at all.

MayaHuman labels or AI labels. Take a side, I'll take the other.

LeoThen I'll argue scale, because the coverage numbers favor it. A human reviewer pool is small, expensive, inconsistent — and it physically cannot label the long tail of rare harms. Constitutional AI's bet is that a written set of principles, plus a model critiquing and revising its own outputs, generates orders of magnitude more safety pressure than any reviewer pool. More edge-case coverage, and the values live on paper instead of in labeler instincts.

MayaAnd my side says: if the system affects people, people define the target. The whole legitimacy of this pipeline rests on the signal coming from human judgment — especially for safety, where context is everything. AI feedback trained on earlier AI behavior can inherit blind spots and then amplify them at the very scale you're celebrating.

LeoThat's the strongest version of the worry, and the constitutional camp partly concedes it — which is why the actual proposal keeps humans writing the principles and auditing the outcomes. It's not humans out of the loop. It's humans moved up the loop.

MayaThen we mostly agree and the slogan was hiding it. The genuine disagreement is dosage — where human judgment enters, how much automation between human checkpoints is safe, and whether audits can catch an inherited blind spot before deployment does.

LeoWe'll have the full version of that fight in the Constitutional AI episode. For builders listening now, though, here's the uncomfortable takeaway from both debates: RLHF is not a final coat of paint.

MayaIt's the behavior contract. In the healthcare scheduler, preference training is deciding — right now, in the weights — whether the assistant answers fast or asks clarifying questions, refuses broadly or narrowly, optimizes for user satisfaction or reviewer approval or clinical caution.

LeoThose are product, legal, and medical decisions wearing a training-data costume. Which means evaluation can't stop at "preference win rate went up."

MayaYou need a behavior scoreboard. Helpfulness on normal requests, safety on risky ones, honesty under uncertainty, consistency across user groups —

Leo— and resistance to gaming, from both directions. Users pressuring the assistant, and reviewers rewarding fluent confidence over correct hedging. The robust-RLHF work in our source list is aimed exactly at that brittleness: when labels are noisy or adversarial, naive optimization amplifies the noise.

MayaThe Berkeley progress-and-challenges talk gives the sober closing frame: real progress — assistants people use every day exist because of this pipeline — and unsolved problems with names. Scalable oversight. Reward hacking. Distribution shift. Knowing what the system is actually optimizing.

LeoBefore we close, the working vocabulary for the whole topic. We'll lean on these words for nine episodes.

MayaHuman feedback means judgments from people about which model behavior is better in a specific context.

LeoPreference data means paired examples where one answer was chosen over another.

MayaReward model means a separate model trained to predict how much humans would prefer a new answer.

LeoPolicy means the language model's behavior pattern — the thing that picks what to say next.

MayaPPO means Proximal Policy Optimization, a cautious update method that improves reward while discouraging big jumps in behavior.

LeoKL penalty means the drift fence in math form — a term that punishes the updated model for straying too far from a reference model.

MayaDPO means Direct Preference Optimization, training directly on chosen and rejected answers with no separate reward-model loop.

LeoAnd reward hacking means finding a shortcut that scores well on the proxy while missing the real human goal.

MayaSo the honest mental model for Topic 5 isn't "ask humans and the model becomes aligned." It's a chain of translation. Messy human judgment becomes preference pairs. Pairs become a proxy reward. The proxy becomes an update. The update becomes behavior in the world —

Leo— and every link can leak. The nurse's circle was evidence about what she valued in that moment. Everything downstream is an attempt not to lose what it meant.

MayaIf your own assistant improved after human-feedback training, what test would convince you it learned the intended preference — rather than a polished shortcut?

Source material

← Back to Mastering Language Models: From Architecture to Optimization