
Subscribe
Transcript
MayaTwo labs publish on the same Friday. Both say the same thing — our coding agent jumped a big chunk on the same hard benchmark. Same headline number. A reporter writes one sentence: "models are getting better at code."
LeoWhich is true and tells you nothing.
MayaRight, because those two results came from completely different places. Lab one swapped in fresh model weights — new pretraining run, new architecture. Lab two touched the weights not at all. They built a fleet of fake-but-runnable repositories, let the agent fail in them thousands of times, and trained on which failures it clawed back from.
LeoSame scoreboard, totally different lever pulled.
MayaAnd figuring out which lever actually moved — that's the whole skill this topic is about.
LeoThe trap, then, is the headline. "It went up." Up because of what?
MayaUp because of what. Hold onto that question — it's the one we aim at every study in this topic.
LeoLove it.
LeoBack up one floor. We did foundations, then how to judge agents, then the systems and harnesses around them. Where does this topic sit?
MayaThis is the "so what made it better" topic. The last one taught you to read a number; this one teaches you to read a change in a number. Somebody ran an experiment, something improved — now you play detective on the cause. And the cause is never one thing, and the people reporting it have every incentive to credit the most impressive-sounding thing.
LeoWhich is always "the model."
MayaThe model's the celebrity. The environment, the data pipeline, the reward design — that's the crew nobody interviews. This topic is about interviewing the crew.
LeoGive me the tool, then. Reading one of these studies — what do I do?
MayaThere's a reading frame, and I'll give it a name so it travels. Call it the five-link Causal Chain. You walk a result backwards along five links and don't let go until you've got all five. The lever — what they actually changed. The setup — what they held fixed so the change is the only suspect. The result — what moved, on which benchmark.
LeoThat's three. What are the last two?
MayaThe artifact — what reusable data or environment the work left behind. And the implication — what a training team does differently on Monday.
LeoWait — the artifact one. Why's that the link people skip?
MayaBecause a paper that says "we got plus eight points" hands you a number. A paper that says "we built thirty thousand executable tasks with hidden tests, and here they are" hands you a machine. The number's a photograph. The artifact's the camera that took it. One you can frame, the other you can shoot a thousand more with.
LeoHuh. So two studies, identical bump — one's a press release, the other's a gift to everyone who reads it.
MayaThat's the distinction the topic turns on. Keep the camera, not the photograph.
LeoOkay. Pin down the running example before we walk the chain, because every topic you do has one.
MayaOne team trying to make their coding agent better at a single painful job — fixing a real bug in a real repository. Failing test, a few files, a fix that has to run and pass checks the agent can't peek at. A dropped-bug ticket.
LeoAnd every study is a different way to train for that.
MayaDifferent lever, same target. We run that one bug-fixing agent through four different workshops and watch what each one changes. Let me name them — they're the map for the topic.
LeoGo.
MayaStart in the World-Model Room. The idea that the agent shouldn't just read code, it should learn what code does when it runs — predict the state after a line executes, not how the line looks.
LeoLess "spell-check the code," then — more "simulate it in your head."
MayaExactly the move. And a follow-on study is about debugging that internal picture when it's quietly wrong. Down the hall is the Task Forge — you don't have enough real bugs, so you manufacture them. Break real repositories in controlled ways, wrap each break in a runnable environment with tests, and now you've got near-endless reps.
LeoHold on — I have to flag that one. "Manufacture the training data" is exactly where I get nervous. Synthetic bugs aren't real bugs.
MayaGood — flag it.
MayaWe'll have that fight in a minute. Let me finish the map. Next door, the Reward Range — reinforcement learning. The agent tries, gets a signal, adjusts; the interesting studies are about what the signal is, because pure pass-fail is brutally sparse.
LeoAnd the last one?
MayaOut back, the Terminal Yard — the unglamorous one. Making an agent good at the command line turns out to be mostly a data-and-environment problem, not a make-the-model-smarter one. There's a mobile corner too, and the environment matters even more there.
LeoWorld-Model Room, Task Forge, Reward Range, Terminal Yard. Let me walk the chain on the Task Forge, since that's the one that made me twitch. Lever — they changed the data, not the weights. Setup — hold the base model fixed, so the only suspect is the new data?
MayaThat's the clean version. Same model, trained on the forged tasks versus not.
LeoResult, score goes up on the real benchmark. Artifact — your favorite part — is the generator itself. The pipeline that makes the tasks.
MayaThat's the whole game. The score's nice. The task generator is the thing a hundred other teams can pick up and run. SWE-smith, in this topic, is exactly that — scaling the production of tasks, not just reporting a win.
LeoMeaning Monday, I stop hand-curating bugs and I build a forge.
MayaYou move your effort from collecting reps to manufacturing them at quality. Which lands us on your twitch.
LeoYeah. A synthetic bug is one you invented — you know the answer because you planted it. Real bugs are weird: a misunderstanding, a race condition, somebody's three-year-old assumption. Train on a million planted bugs and am I making the agent good at fixing bugs, or good at fixing plants?
MayaHere's the strongest case for synthetic, and I'll actually defend it. The bottleneck is volume. There are maybe a few thousand high-quality, human-verified real-bug tasks in the entire public world. You cannot do reinforcement learning at scale on a few thousand of anything.
LeoScarcity forces your hand, then.
MayaAnd — the part that's easy to miss — a synthetic task with a hidden test and a runnable environment is checkable. The agent can't bluff. It makes the test pass in the container or it doesn't. So even if the bug's origin is fake, the verification is real.
LeoMm — that part survives, I'll give you that. The execution's real even if the bug's fake. But my worry's distribution. You can verify a forged bug perfectly and still train on a distribution that doesn't look like the wild.
MayaAnd you're right, and the honest answer is that's why quality gates exist and why nobody serious claims synthetic replaces real. The strong position isn't "synthetic instead of real." It's "synthetic for volume, real for validation, and you watch the gap like a hawk."
LeoWhere do we land, then? I don't want to leave it at "both sides have a point."
MayaWe land here. On the training signal, synthetic plus strong verification wins — volume times checkability beats a tiny pile of pristine real tasks. On evaluation, fresh real wins, full stop — you never grade the agent's report card on the same kind of bug it trained on.
LeoTrain synthetic, test real.
MayaTrain synthetic at volume, test on fresh real, and treat the distance between the two scores as a measurement, not a nuisance. If it crushes the forge and flops in the wild — that gap is your finding.
LeoOkay. That I'll buy.
MayaAnd notice — we didn't narrate "experts disagree." We had the fight, and the resolution came from splitting the training question from the evaluation one.
LeoPush the World-Model Room into the concrete for me. "Learns what code does when it runs" — what does that buy me on the dropped-bug ticket?
MayaYour bug's in a function that mutates a list while looping over it. A model that only learned how code looks sees nothing wrong — syntax fine, names fine. A model that learned what code does runs it forward in its head and goes: wait, on the third pass this index points at the wrong element.
LeoSo a world model is the gap between proofreading the recipe and tasting the dish.
MayaProofread versus taste — good. And the follow-on work is honest about the catch: these internal world models are themselves buggy. The mental simulation can be confidently wrong. So there's a study just on debugging the world model — finding where its prediction of "what happens" diverges from what actually happens.
LeoA bug in the thing that's supposed to find bugs. When a field starts debugging its own tools, it's getting serious.
MayaThe Reward Range, then. "Reinforcement learning made it better" might be the most over-credited sentence in this whole area. So what's the lever in an RL study?
LeoIt's supposed to be "the agent learned from reward." But the actual lever, half the time, is the reward design, not the learning.
MayaThat's the detective move. The dirty secret of RL on code: the reward is usually "did the hidden test pass," and that's brutally sparse. The agent flails through a long fix and at the very end gets one bit — yes or no.
LeoOne bit for a hundred decisions. Almost nothing to learn from.
MayaWhich means the real innovations aren't "we did RL" — everyone does RL. They're how you fight the sparsity. One study adds guidance: when the agent's stuck, you hint, let it recover, and train on the recovered run. Another builds a learned verifier for partial credit, so the reward isn't one lonely bit at the end.
LeoRight — so when a paper says "RL improved our agent," the question is —
Maya— what was the reward, and where did the signal come from. RL's the verb everyone uses. The reward's the noun doing the work.
LeoNice.
MayaAnd the artifact here is sneaky-valuable — not the trained agent, the failed trajectories. The runs where it got stuck and recovered. Those teach recovery, and you only get them by letting the agent fail in a real environment.
LeoEveryone wants to throw the failures away.
MayaThe failures are the curriculum. A perfect run teaches almost nothing; a clawed-back one teaches recovery, which is most of real engineering.
LeoThe Terminal Yard — the boring workshop. Make the case for why I should care.
MayaIt carries the biggest hidden lesson. When teams set out to make an agent good at the terminal — commands, stack traces, recovering from a failed build — the instinct is "we need a smarter model." What the work keeps finding is: no, you need better data and better environments. Ten thousand clean, reproducible terminal scenarios beats a cleverer model on messy data.
LeoTerminal capability is a data-engineering problem wearing a machine-learning costume, then.
MayaThat's it. The headline says "the model got better at the terminal." The truth is "somebody built ten thousand environments that don't flake."
LeoWhich loops right back to the opening question. Up because of what.
MayaAlways.
LeoLet me play it back, whole. Every improvement study is a claim about a lever. Walk it backwards — lever, setup, result, artifact, implication — and refuse to credit "the model" until you've checked whether it was the data, the environment, the reward, or the verifier.
MayaThat's it. And the deepest version — the one to leave with — is that the most reusable output of these studies is almost never the model. It's the environment and the data. The camera outlasts the photograph.
LeoThe camera outlasts the photograph. That's the sticker.
MayaQuick vocabulary pass before we close — a few terms got used loosely. A trajectory?
LeoThe full recorded path of an attempt — every command, file read, edit, test run, and retry — not just the final patch.
LeoA verifier?
MayaWhatever decides if the work is good — an executable test, or a model trained to judge — handing back a signal the agent can learn from.
MayaA reward signal?
LeoThe feedback the agent learns from in reinforcement learning — at its crudest, one pass-or-fail bit; at its best, shaped to give partial credit.
LeoA world model, in the coding sense?
MayaThe model's internal sense of what code does when it runs — predicting program state, not just the next token of source.
MayaA synthetic task?
LeoA software problem you generate rather than harvest — usually a controlled break in a real repository, wrapped in a runnable environment with hidden tests.
LeoAnd a hidden test?
MayaA check the agent can't see while it works, so it can't tune its patch to pass it — which is what makes the verification trustworthy.
MayaCarry those, and every paper in this topic gets easier to read.
LeoHere's what I'm sitting with. We agreed the camera outlasts the photograph. But almost every headline frames the photograph.
MayaSo here's the question to leave on. Next time you read that some coding agent got dramatically better — before you believe the model got smarter, what's the one piece of evidence you'd want to see that would tell you it was really the environment and the data?
Source material
- CWM: An Open-Weights LLM for Research on Code Generation with World Models
- Debugging Code World Models
- Training Software Engineering Agents and Verifiers with SWE-Gym
- SWE-smith: Scaling Data for Software Engineering Agents
- SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
- Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
- R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents
- SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks
- Nemotron-Terminal: On Data Engineering for Scaling LLM Terminal Capabilities
- SWE-Bench Mobile
← Back to Agentic Coding Capability: From Coding Models to Coding Agents