T5E3 · Apr 27, 2026 · 00:14:03

T5E3 · Training Language Models to Follow Instructions with Human Feedback

Maya and Leo dig into the InstructGPT paper — the moment the human-feedback recipe grew from a summarization trick into the way assistants get made. They walk the pipeline as three stations and a punch list (the Apprenticeship, the Ranking Desk, the Governor, the Punch List), stage the scale-versus-feedback argument over the famous result that humans preferred a 1.3B-parameter aligned model to the 175B GPT-3 baseline, and close on why the feedback process itself — labelers, instruction sheets, audits — is the real product.

Transcript

MayaEarly twenty twenty-two. Someone types a simple request into the largest language model money can rent: explain the moon landing to a six-year-old, in a few sentences. And the model — one hundred seventy-five billion parameters of it — answers with... more homework. Explain the theory of gravity to a six-year-old. Then the theory of relativity. Then a whole worksheet more.

Leo[chuckle] More prompts.

MayaA whole list of them. Because that is what those words look like in its training data — items on a worksheet. The model didn't disobey the instruction. It never perceived an instruction.

LeoAnd that one failure is today's whole subject. The paper is Training Language Models to Follow Instructions with Human Feedback — the Instruct-G-P-T paper, from OpenAI in twenty twenty-two. The moment the human-feedback recipe stopped being a summarization trick and became the way assistants get made.

MayaLast episode, human comparisons replaced reference-matching on one narrow job — summarizing forum posts — and the feedback-trained model won. Today the same recipe gets aimed at everything a user might ask: poems, spreadsheet formulas, travel plans, things the model should refuse to do.

LeoQuick orientation for anyone just joining: we're mid-topic on Reinforcement Learning from Human Feedback — R-L-H-F — training models on human judgments instead of fixed targets. This is the episode where that idea grows up into a general-purpose assistant.

MayaPlain language before any machinery. Instruction following means the model is trying to do what you intended — not generate the statistically smooth continuation of your words. The base model is a magnificent pattern-finisher. Hand it half a contract, it drafts the other half. Hand it a request—

Leo—it writes more requests. Right. So how do you turn a pattern-finisher into something that takes the job?

MayaWith a staged recipe. Three stations, and then a punch list at the end: the Apprenticeship, the Ranking Desk, the Governor — and the Punch List, because the authors kept an honest one.

LeoApprenticeship first.

MayaYou hire people — labelers — and they write the answers they wish the model had given. Real demonstrations of the target behavior: here is the request, here is what a good assistant says back. Then the model is fine-tuned to imitate those demonstrations. That step is supervised fine-tuning — S-F-T — and it's the seed of everything after.

LeoMm-hm.

MayaIt is imitation, plain and simple — but imitation of carefully chosen behavior instead of the whole internet.

LeoLet's pin it to our running example — the healthcare scheduling company and its support assistant. Apprenticeship there means reviewers write out the answers they actually want: billing questions routed to support, appointment questions answered with open slots, symptom questions steered toward a clinician instead of a guess.

MayaAnd after one pass of that, the model already behaves noticeably more like an assistant. But imitation has a ceiling.

LeoTwo ceilings, really. Your reviewers can't write demonstrations for everything users will invent. And the model can copy the surface of a demonstration without the reason underneath — it learns the shape of a refusal, not why the boundary exists.

MayaWhich is what the Ranking Desk is for. Here you stop asking humans to write and start asking them to judge. The model produces several answers to the same prompt, and labelers put them in order — best to worst, the whole lineup.

LeoAnd those orderings train the reward model — the learned judge from last episode, predicting which answer a person would prefer. Cheaper signal per minute of human time, too. Judging is faster than writing.

MayaOne detail at this desk deserves more attention than it usually gets: where the prompts came from.

LeoNot a benchmark?

MayaNot a benchmark. They came from the labelers and from real users of the OpenAI A-P-I — actual requests people sent, in all their mess.

LeoThat's the part I'd defend hardest, honestly. Train your judge on a tidy benchmark and you've aligned the model to an exam. Train it on the live distribution and at least the judge has seen the weather.

MayaWith the usual warp attached: the judge also absorbs the labeler pool's habits. Their patience, their taste in tone, the instruction sheet they were handed.

LeoNoted for the Punch List.

MayaStation three. The model generates answers, the reward model scores them, and reinforcement learning — the Proximal Policy Optimization machinery from two episodes back, P-P-O — pushes the model toward higher scores.

LeoThe cautious update rule.

MayaThat one. But here the push runs through a Governor.

LeoGovernor as in the device on an engine that won't let it over-rev.

MayaThat's the image exactly — a penalty that keeps the tuned model from drifting too far from the supervised model it started as. Improve against the judge, but stay recognizably the assistant the demonstrations built. Without it, the optimizer sprints at the reward and the language itself starts to bend. Flattery, loopholes, over-explanation — anything the judge happens to score well.

LeoWe watched that exact curve last episode — judge score climbing while humans liked the output less. The governor exists because they knew the sprint ends in a ditch.

MayaSo that's the recipe whole, if you're just resurfacing: demonstrations seed the behavior, rankings teach a judge, and governed optimization amplifies what the judge prefers.

Figure 2: A diagram illustrating the three steps of our method: (1) supervised ﬁne-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) oSource: Training Language Models to Follow Instructions with Human Feedback

LeoNow the result. And I want this one read slowly.

Maya{emotion=excited} On their prompt distribution, human evaluators preferred the one point three billion parameter InstructGPT model over the one hundred seventy five billion parameter G-P-T three baseline.

LeoOver a hundred times smaller. Preferred. Not "competitive with" — preferred.

Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT models (PPSource: Training Language Models to Follow Instructions with Human Feedback

MayaThe contrast is the finding. Usefulness was not sitting on the scale dial. You could keep making the pattern-finisher bigger, and it would keep finishing patterns more beautifully — including the wrong ones.

LeoThe moon-landing answer gets more eloquent. It's still a list of homework.

Maya[chuckle] More gravity, better prose. And the wins went past preference — the paper reports gains on truthfulness-style evaluations and reductions in toxic generations, alongside the ratings.

LeoHold on, though. There was a serious camp that read this result and shrugged, and they deserve their day in court. I usually carry the skeptic's bag here — but this time the table sits on the other side, so I can't.

MayaThen I'll carry it. The scale-centered case, in its strongest form: capability comes from pretraining, full stop. The reasoning, the knowledge, the fluency — all bought by scale, none of it by your feedback pipeline. And the pipeline is not free. It's a second training system with humans inside it — instruction sheets, labeler payroll, disagreement, bias arriving through whoever wrote the guidelines.

LeoMm.

MayaYou can even pay in capability: bolt a behavioral target onto the model and risk regressions on things it used to do well. Why operate all that machinery when next year's bigger model might simply learn these behaviors on its own?

LeoBecause they ran that experiment and it didn't get it. The hundred-seventy-five-billion baseline is the scale camp's own champion — and on real prompts from real users, people preferred a model over a hundred times smaller that had been told, by humans, what the job was. Helpful, truthful, harmless — the paper's own framing — none of those arrived with scale on their own. The gradient of next-token prediction never sees your intent.

MayaFine — the headline stands. Intent is a separate target, and you don't hit a target you never aim at. But notice what I do not have to concede: every operational cost I listed is still real. The pipeline has humans in it, and humans disagree.

LeoThat one's fully yours. When reviewers disagree — say, about how much explanation is helpful before an answer turns evasive — the reward model doesn't learn the truth of the matter. It learns a compromise. Sometimes a shortcut.

MayaSo here's where it settles for me: scale buys raw capability, feedback aims it, and neither substitutes for the other. The honest reading is not "small beats big." It's "aimed beats unaimed — even at a hundred-to-one odds."

LeoAnd where it settles for me: after this paper, the alignment step stopped being optional. Shipping the raw base model to users started to look like shipping an engine with no steering column.

MayaWhich walks us straight onto the Punch List — the authors were unusually careful not to declare victory.

LeoStart with the bluntest one: the aligned model still makes simple mistakes. Still hallucinates. Politeness is not truth.

MayaThen there's inheritance. The system inherits its feedback process — the quality and representativeness of the labels become the quality of the assistant. Whose preferences counted? Who wrote the labelers' instruction sheet? Those are training decisions now, with fingerprints on the model.

LeoAnd the item I'd circle in red for any deployment team — feedback gaming. Back at the scheduling company: suppose reviewers reliably reward warmth. The judge learns that a long compassionate preamble equals good. Governed optimization amplifies it, and—

Maya—every patient gets three paragraphs of empathy before anyone offers them a Tuesday slot.

LeoThe proxy got satisfied. The intent didn't. And here's why that scares me at scale: optimization generalizes. That's its power and its threat in the same breath. A small judge mistake about confident tone doesn't stay small — after the Governor's run, it's the assistant's personality everywhere.

MayaHmm, and that's why the durable mental model from this paper isn't "RLHF works." It's the preference control loop as infrastructure. Human judgment is expensive and partial, so you train a scalable stand-in, amplify it under a governor — and then you treat the stand-in like production infrastructure. Reviewer training, disagreement audits, adversarial prompts, monitoring after launch.

LeoThere's a quieter contrast hiding in there too. Supervised examples teach the model what a good answer looks like. Preference rankings teach it which of several plausible answers people actually choose.

MayaAnd that distinction earns its keep wherever many answers are valid. A nervous patient asking how soon to book has no single canonical reply — what makes the answer good is behavioral. Acknowledge the uncertainty, offer the path to scheduling, skip the diagnosis, keep the tone calm.

LeoOne more honest limit before we close it out. That prompt distribution, realistic as it was, is still one distribution. A hospital network, a school district, and a legal aid clinic all draw different lines. The generic assistant follows instructions better — it does not know your policies.

MayaSo local deployment keeps its own checks. The paper made the assistant; it didn't make your compliance team redundant.

LeoSadly.

MayaAnd the bridge forward writes itself. Once you train on human preference, the preferences start colliding—

Leo—be maximally helpful, and also refuse to help with the dangerous thing. Same model, same moment.

MayaNext episode lives inside that collision.

LeoAnd further down the road, someone asks the impertinent question: do we even need the separate judge and the reinforcement loop, or is there a shorter path through the same math? That fight is coming.

MayaFor today, the line worth keeping: this paper turned alignment from a research aspiration into a pipeline — stages, budgets, failure modes. The model never magically learns human intent. It learns from demonstrations and rankings collected by particular people, under a particular process.

LeoAnd the process is the product. Every strength of InstructGPT traces back to that pipeline — and so does every debt.

MayaSo here's the question to walk away with: if your model suddenly got much better at pleasing its reviewers, where would you look to find out whether it learned your users' intent — or only learned the shape of approval?

Source material

Training Language Models to Follow Instructions with Human Feedback

← Back to Mastering Language Models: From Architecture to Optimization