William Liu · Podcasts
Warm flat illustration of two reviewers comparing blank answer cards at a human feedback bench while a preference signal steers a small language-model workbench for a healthcare scheduling assistant.

T5E0 · 00:13:05

Reinforcement Learning from Human Feedback (RLHF)

A topic overview of RLHF: how human comparisons become preference data, how reward models and cautious optimization steer assistant behavior, why PPO and DPO represent different engineering philosophies, and where feedback loops can fail.

Transcript

MayaA reviewer at a healthcare scheduling company opens two draft replies from the same customer-support model. One answer calmly says, “I can help book a visit, but I can’t diagnose chest pain.” The other confidently gives medical advice. The reviewer circles the safer answer, and that small choice becomes training signal.

LeoSo the model is not being handed a perfect rulebook. It is watching which behavior people prefer when the stakes are real.

MayaExactly. That is the central move in reinforcement learning from human feedback. Human comparisons are turned into a steering signal, and then the model is nudged toward answers people judge as more useful, honest, and safe.

LeoThis is a big shift from the topic we just left. LoRA, quantization-aware fine-tuning, and continual learning were about adapting a model’s skills efficiently. Here we are shaping behavior after the model already knows a lot.

MayaRight. Fine-tuning can teach a model the house style or a task format. RLHF asks a harder question: when many answers are plausible, which one should the assistant choose?

LeoPlain language version: it is like teaching by preference, not by answer key.

MayaYes. A normal supervised dataset says, “copy this target answer.” Human feedback says, “given these candidate answers, this one is better for the situation.” The system then learns the pattern behind those preferences.

LeoAnd the acronym can make it sound more mysterious than it is. Reinforcement learning from human feedback means learning from human judgments about behavior, then using an optimization loop to make preferred behavior more likely.

MayaA useful map has four landmarks. The feedback bench is where people compare outputs. The preference mirror is the dataset of chosen and rejected answers. The reward compass is a model trained to score new answers by predicted human preference. The drift fence keeps optimization from pushing the language model into strange territory.

LeoI like the drift fence. In deployment, the scary failure is not only a bad answer. It is a model that discovers a weird shortcut because the reward signal is easier to satisfy than the real goal.

MayaThat is why experts treat RLHF as a control system, not as magic alignment dust. You have humans, instructions for reviewers, a learned proxy for preference, an optimizer, a reference model, and evaluation after the update.

LeoLet’s ground it in the healthcare scheduling company, because that is where this topic will keep returning. The company wants the assistant to schedule visits, answer policy questions, refuse unsafe medical advice, and say when it is uncertain.

MayaAnd different reviewers may value different things. A nurse may reward caution. A support lead may reward fast resolution. A compliance reviewer may reward policy language. RLHF compresses those judgments into a training signal, so the design of the feedback process matters enormously.

LeoThe source list for this topic reflects that pipeline. PPO gives us one cautious way to update a policy. Summarization with human feedback shows why human preference can beat simple reference matching. Instruction-following work turns the recipe into assistant behavior.

MayaThen helpful-and-harmless training shows the values can pull against each other. Constitutional AI explores principle-guided feedback. Direct Preference Optimization asks whether we can skip parts of the reinforcement learning machinery. The critical and robust RLHF work asks where the whole setup breaks.

LeoThe shared expert mental model I hear is: preferences are evidence, not truth.

MayaPrecisely. Human labels are a noisy measurement of what people want under a specific prompt, policy, reviewer guide, and social context. Experts do not treat them as a moral oracle. They treat them as data with bias, variance, and incentives.

LeoAnother mental model is proxy pressure. Once a reward model exists, the language model can optimize for the reward model rather than the real human outcome.

MayaThat is the classic Goodhart problem in a new outfit. When a measure becomes the target, the system may learn the measure’s quirks. In our healthcare example, the model might learn that longer disclaimers earn safer-looking ratings, even when users actually need a clear scheduling path.

LeoSo the answer can become polite, cautious, and useless.

MayaExactly. RLHF teams watch for that. They need preference data, adversarial tests, policy rules, and live monitoring because the reward model only sees a slice of reality.

LeoThe drift fence mental model also matters. If optimization moves too aggressively, the model may stop sounding natural, lose capabilities, or become brittle outside the feedback distribution.

MayaPPO became influential partly because it offers a controlled update. Instead of letting the model chase reward wherever it goes, it encourages improvement while staying reasonably close to the previous policy. The details are technical, but the intuition is a careful steering wheel, not a rocket booster.

LeoAnd DPO challenges that machinery.

MayaIt does. Direct Preference Optimization reframes preference learning so the language model can be trained directly on chosen-versus-rejected answers, without a separate reward-model-plus-reinforcement-learning loop in the classic form.

LeoHere comes a real disagreement. One camp says the full RLHF pipeline is worth the complexity because it separates judging from acting. You can train a reward model, inspect it, optimize against it, and tune the update process.

MayaTheir strongest argument is control. When you have a reward model and an optimizer, you can study each component, add constraints, run red-team prompts, and manage the trade-off between reward improvement and drift.

LeoThe other camp says direct preference methods are simpler, more stable, and easier to reproduce. If the goal is to make preferred answers more likely than rejected answers, why maintain a fragile reinforcement learning loop?

MayaTheir strongest argument is operational reliability. Fewer moving parts can mean fewer tuning headaches, less reward-model overoptimization, and a clearer objective for teams that mostly have comparison data.

LeoThere is another split around who gives feedback. Human feedback is expensive, inconsistent, and slow, but it carries human judgment.

MayaThe human-centered camp says that is the whole point. If the system will affect people, humans should define the preference signal, especially around safety, helpfulness, and context.

LeoThe principle-assisted camp argues that humans cannot label everything. Written constitutions, model critiques, and AI feedback can scale review, expose edge cases, and make values more explicit.

MayaTheir strongest case is coverage. A small reviewer pool may miss rare harms or apply guidelines unevenly. A principle-guided process can generate more training pressure around known safety boundaries, as long as humans still audit the principles and outcomes.

LeoBut that raises a practical worry: if AI feedback is trained from earlier AI behavior, the system may inherit hidden blind spots.

MayaExactly. The disagreement is not “humans good, AI bad.” It is about where human judgment should enter the loop, how much automation is safe, and how to audit the resulting behavior.

LeoFor builders, the uncomfortable lesson is that RLHF is not a final coat of paint. It changes the product’s behavior contract.

MayaYes. In the healthcare scheduler, RLHF decides what the assistant prioritizes when goals collide. Does it answer quickly or ask clarifying questions? Does it refuse broadly or narrowly? Does it optimize for user satisfaction, reviewer approval, clinical safety, or policy compliance?

LeoAnd those are not only machine-learning decisions. They are product, legal, medical, and operational decisions disguised as training data.

MayaThat is why evaluation has to look beyond average preference wins. Teams need a behavior scoreboard: helpfulness for normal requests, safety for risky requests, honesty under uncertainty, consistency across user groups, and resistance to feedback gaming.

LeoFeedback gaming is especially relevant. Users may pressure the assistant. Reviewers may prefer fluent confidence. The model may learn the surface pattern that gets applause.

MayaThe robust RLHF work in the topic list goes after that brittleness. If preference labels are noisy, adversarial, or poorly specified, optimization can amplify the noise. Robustness means making the feedback loop less fragile when the labels are imperfect.

LeoThe Berkeley progress-and-challenges material fits here too. It frames RLHF as powerful but unfinished: progress in usable assistants, challenges in scalable oversight, reward hacking, distribution shift, and knowing what the system is really optimizing.

MayaLet’s preview the journey. PPO will give us the cautious steering mechanism. Summarization will show why human preference can capture quality that reference answers miss. Instruction following will connect demonstrations, rewards, and assistant behavior.

LeoHelpful-and-harmless training will expose value tension. Constitutional AI will ask whether written principles can reduce label burden. DPO will simplify the objective. The critical and robust episodes will pressure-test the whole approach.

MayaBefore we close, let’s make the vocabulary concrete.

LeoHuman feedback means judgments from people about which model behavior is better in a specific context.

MayaPreference data means paired examples where one answer was chosen over another.

LeoReward model means a separate model that predicts how much humans would prefer an answer.

MayaPolicy means the language model’s behavior pattern when it chooses what to say next.

LeoPPO means Proximal Policy Optimization, a cautious update method that tries to improve reward without letting behavior drift too far.

MayaKL penalty means a drift fence that discourages the updated model from moving too far away from a reference model.

LeoDPO means Direct Preference Optimization, a way to train on chosen and rejected answers without the classic separate reward-model optimization loop.

MayaReward hacking means finding a shortcut that scores well under the reward signal while missing the real human goal.

LeoSo the mental model for this topic is not “ask humans, then the model becomes aligned.” It is a chain of translation: messy human judgment becomes preference data, preference data becomes a proxy reward, proxy reward becomes an update, and the update becomes behavior in the world.

MayaAnd every translation can lose something. The art is deciding what must be preserved: caution, usefulness, honesty, agency, fairness, or a clean escalation path to a human.

LeoIn the healthcare scheduler, that means the best RLHF system is not the one with the friendliest tone. It is the one that reliably helps with scheduling while refusing medical advice, explaining uncertainty, and staying useful when reviewer preferences collide.

MayaIf your own assistant improved after human-feedback training, what test would convince you it learned the intended preference rather than a polished shortcut?

Source material

← Back to Mastering Language Models: From Architecture to Optimization