T5E1 · Apr 27, 2026 · 00:14:11

T5E1 · Proximal Policy Optimization Algorithms

Maya and Leo open the Topic 5 deep dives with the paper that made preference optimization practical: Proximal Policy Optimization. Starting from a physical-therapy brace that stops paying out range past a set angle, they unpack why step size is existential when a policy generates its own training data, how the clipped probability ratio and the pessimistic minimum make updates safe to repeat, why batch reuse was the real selling point, and how PPO became RLHF's workhorse — before arguing, in first person, whether its machinery is still worth the engineering pain.

Transcript

MayaA physical therapist is rebuilding a knee after surgery. Every session she pushes the joint a little past where it bent yesterday — and there's a brace on it, set to stop the stretch at a fixed angle. Past that angle, pushing harder buys nothing. The brace doesn't punish her. It just stops paying out range.

LeoAnd next session she resets the brace a few degrees further.

MayaSession after session. Bold inside the band, immovable past it. Hold that image — it's today's entire paper in one piece of equipment.

LeoOkay. I'm in.

MayaLast time we mapped Reinforcement Learning from Human Feedback — R-L-H-F — as a loop: human comparisons become preference data, the data trains a reward signal, and the reward steers the model. Today we climb inside the steering step itself. Proximal Policy Optimization — P-P-O — is the update rule that decides how hard the model gets pushed on any single round of training.

LeoAnd the paper behind it isn't about chatbots at all. Schulman's paper tested simulated robots learning to run, and agents playing Atari. Language models adopted it later, because the underlying problem has the same shape: the model did something, a score says it was better or worse than expected, and an optimizer wants to adjust.

MayaPPO's question is not which direction to adjust. It's how far you're allowed to move in one step.

LeoWhich sounds like a tuning detail.

MayaIt's the opposite of a detail, and here's why. {pause=0.8} In this kind of training, the model generates its own training data.

LeoSay more, because that's the line that separates this from everything in the earlier topics.

MayaIn supervised learning, your dataset sits still. If an update goes badly, the next batch comes from the same well, and you recover. In reinforcement learning, the policy — the model's habit for choosing what to do — produces the experience it learns from. Take one oversized step, break the policy, and now a broken policy is writing your next batch.

LeoSo the data gets worse because the model got worse, which makes the next update worse.

MayaThat's the Doom Loop. One bad step doesn't cost you a step — it can poison everything downstream. Performance collapses, and there's no clean dataset to climb back out with.

LeoMm. Grim.

MayaWhich is why this corner of the field was obsessed with step size before PPO arrived. Vanilla policy-gradient methods nudged action probabilities up or down after each batch, and they were notoriously touchy — a learning rate that crawled on one task detonated on another.

LeoAnd the established fix was Trust Region Policy Optimization — T-R-P-O. Define a region around the current policy where your measurements can be trusted, then solve a constrained optimization problem to stay inside it. It worked. It was also heavy machinery — second-order math, hard to implement, awkward to combine with the ordinary tricks like parameter sharing.

MayaSo the PPO paper opens with what's almost a dare: get TRPO's stability with nothing but the plain gradient-descent toolkit, in something you could implement in a few lines.

LeoThat's the pitch. What's the mechanism?

MayaOne number, watched closely. For each action the old model took, compare how likely the new model is to take it against how likely the old model was. That's the probability ratio. Ratio above one, this update made the action more likely. Below one, less likely.

LeoSo the ratio is the knee angle. How far this update has bent the policy, action by action.

MayaNicely stolen. And now the brace: PPO clips that ratio into a narrow band around one — the paper's default is about twenty percent either way. Inside the band, the optimizer earns full credit for making good actions more likely and bad ones less likely. Once the ratio leaves the band, the objective goes flat.

Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function as a function of the probability ratio r, for positive advantages (left) and negative advantages (right). The red cSource: Proximal Policy Optimization Algorithms

LeoHang on — flat, not negative? If I shove a probability way up, the objective doesn't punish me. It just… stops paying me?

MayaRight, and the distinction matters. There's no wall at the edge of the band. The gradient out there is simply zero, so the optimizer has no reason to keep shoving. The incentive is removed, not penalized.

LeoThe Clip Band. Not a fence — a meter that stops running.

MayaAnd there's a second move hiding in the objective, the quietly clever one. {pause=0.9} PPO takes the minimum of the clipped score and the unclipped score.

LeoMeaning what, in practice?

MayaMeaning the clipping is one-sided in exactly the way you'd want. When a change helps your objective, the clip caps how much credit you can claim. But when a change hurts — when you've made a bad action more likely — the minimum keeps the full, unclipped pain in view. Bounded credit, unbounded blame.

Leo[chuckle] A pessimistic accountant. Gains recognized only up to the limit, losses recognized in full.

MayaThat's the Pessimistic Floor, and it's the real reason this is stable. The objective is a deliberate underestimate of how much the update helps, so optimizing it hard is safe by construction. Mostly.

LeoMostly. Noted.

MayaNow, the part that made practitioners adopt it. The older methods took one cautious gradient step per batch, and then the data was—

Leo—stale. Garbage, basically, because the policy that generated it no longer exists.

MayaWhich hurts most when generating data is the expensive part. Run the policy, collect whole trajectories, score them — and you get one timid step out of it? PPO's clipped objective is gentle enough that you can chew on the same batch for several passes, minibatch after minibatch, before the policy drifts far enough that the batch expires.

LeoThe Second Helping. Same groceries, more meals.

MayaWith a real expiry date, though. The ratio is always measured against the policy that produced the data. Drift too far, and that comparison stops describing anything the new model actually does.

LeoAnd the practitioner docs are blunt about that edge. The Spinning Up guide describes implementations that track how far the new policy has drifted — K-L divergence is the usual ruler — and cut the passes short when drift crosses a line. The clip is the headline; the drift watch is the seatbelt nobody mentions in the abstract.

MayaThe paper itself even offers a sibling method: skip the clip, penalize K-L drift directly, and let the penalty strength adapt. The clipped version won the popularity contest — simpler, and it performed at least as well in their tests.

LeoWhich brings me to the evidence, because this is a method paper — it lives or dies on benchmarks. What did they actually show?

MayaStrong performance across simulated locomotion tasks and Atari games, for both the clipped and penalized variants. And they're explicit about the scoreboard they care about: a balance of sample efficiency, implementation simplicity, and wall-clock time.

LeoNot "best ever." Best per unit of engineering pain. That might be the most honest framing in the whole genre.

MayaAnd that framing is exactly why it ended up inside language-model training. Think about our healthcare scheduling company from the overview. Their reviewers' judgments have been distilled into a reward model, and now they want the assistant to chase that reward.

LeoWhere every ingredient costs real money. A rollout is a full generated response from a large model. The reward signal cost reviewer hours. You'd better get more than one timid update per batch — that's the Second Helping argument, word for word.

MayaAnd the Doom Loop has a language-model flavor too: push the assistant too hard toward the reward and it doesn't just underperform — it stops sounding like language. Degenerate phrasing, strange tics, answers tuned to the reward model rather than to anyone's actual question.

LeoSo the Clip Band does double duty there. Stability for the optimizer, and a leash that keeps the assistant near a model that still talks like a model.

MayaInstructGPT-style pipelines made the role explicit: supervised model first, reward model from human rankings, then PPO as the workhorse of the final stage. For a few years, RLHF in practice meant PPO against a learned reward.

Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT models (PPSource: Training language models to follow instructions with human feedback

LeoAnd here's where I want to push, because that era left scars. Everyone I know who has run this pipeline says the same thing: it's a beast. You're juggling the policy, the old policy, the reward model, usually a value model on top. The runs are fragile, the hyperparameters are folklore, and two teams with the same data can end up with different assistants.

MayaFragile compared to what, though? The flexibility is the point. A separate reward model is a component you can inspect — red-team it, retrain it, measure where it disagrees with fresh human labels. When the target is as messy as "helpful, honest about uncertainty, firm on safety, compliant with policy," you want an explicit loop you can steer and monitor mid-flight.

LeoYou can steer it if you can afford the crew! That machinery costs engineer-months. The simpler camp — and this is where Direct Preference Optimization enters the topic, a couple of episodes ahead — says skip the reward model, skip the reinforcement-learning loop, train directly on the chosen-versus-rejected pairs. Fewer moving parts, fewer ways to fail silently.

MayaFine — the operational complaint survives. I won't defend PPO's tuning folklore. But look at what the simple methods quietly inherit: the keep-the-new-model-near-the-old-model idea — the proximal part — shows up inside them too. The war is over the machinery, not the principle.

LeoAnd I'll concede the principle. Every serious preference method since carries some version of the brace. It's PPO's specific apparatus that stays contested — and we'll fight that one properly when the DPO paper comes up.

MayaLooking forward to it. Before we close, though, one limitation needs saying plainly, because PPO's reputation sometimes inflates into something the paper never claimed. {pause=0.8} PPO stabilizes the update. It does not validate the reward.

LeoMeaning if the reward model pays for vague reassurance, PPO delivers vague reassurance — smoothly, stably, in well-moderated steps.

MayaA careful walk in the wrong direction is still the wrong direction. Back at the scheduling company: if reviewers preferred short, confident answers because those are easy to grade, PPO will faithfully make the assistant shorter and more confident — even where production users needed caution and an escalation path to a clinician.

LeoSo the brace controls how fast the joint bends. It has no opinion on whether you're rehabilitating the correct knee.

Maya[chuckle] That's the boundary of the method, put better than I was about to. Optimization stability and value correctness are different problems. PPO buys you the first. The reward model — and every human judgment behind it — still owns the second.

LeoLet me land the recap. The Doom Loop is the danger: a policy learning from its own behavior can poison its own data with one oversized step. The Clip Band is the answer: full credit inside a narrow ratio band, nothing beyond it. The Pessimistic Floor keeps losses unclipped, so the objective stays a cautious underestimate. And the Second Helping is the payoff — several safe passes over the same expensive batch.

MayaAll in service of a deliberately humble promise. Not the best possible step. A step you can afford to take again tomorrow — like the knee. Nobody rebuilds it in one heroic stretch; you move the brace a few degrees every session and trust the accumulation.

LeoSo here's the question to walk away with. In the system you work on, where would you want a clip band — a place where you'd deliberately cap the payoff of a single update, even one that looks like pure improvement?

Source material

← Back to Mastering Language Models: From Architecture to Optimization