T4E1 · Jul 1, 2026 · 00:13:46

CWM — Code Generation with World Models

The first deep dive of Topic 4 opens the World-Model Room. Maya and Leo unpack CWM — Meta's open-weights Code World Models LLM — whose lever is mid-training a 32B model on observation-action traces from a Python interpreter and agentic Docker environments, so it learns what code does when it runs, not just how it looks. They show the payoff on the topic's running bug — a cache that returns stale data on the second call, which a read-the-code model misses and a world model catches by simulating two calls — and stage the real attribution debate: did the trace footage win the SWE-bench number, or did the verifiable-environment RL on top? The throughline: the artifact worth copying is the recipe for turning program runs into world-model training data, and the open checkpoints are an invitation to settle the credit.

Transcript

MayaHere's a party trick. Give a model a Python function and a starting input — but don't let it run anything. No interpreter, no print statements. Then ask it: walk me through this, line by line, and after each line tell me exactly what every variable now holds.

LeoLike dictation, but for a program that's executing in its head.

MayaRight. And most code models faceplant on that. They can write you a gorgeous version of that function. Ask them what it does on line nine, with these specific inputs, and they guess from how the code looks.

LeoBecause they learned to read code, not to run it.

MayaThat's the gap. And today's paper, out of Meta — they call it C-W-M, Code World Models — its whole bet is closing exactly that gap. Teach the model to be the interpreter, not just the author.

LeoOkay, I'm in.

MayaQuick step back, since we're walking into the first real study of this topic. Last time we built the map — the four gyms one bug-fixing agent gets sent to, and that habit of refusing to credit "the model" until you've checked what actually moved.

LeoThe "up because of what" reflex. And we said the first gym was the World-Model Room — learn what code does when it runs, not how it looks.

MayaThis is that room, with the door open. CWM is the World-Model Room made concrete.

LeoSo before we tour it — anchor me on the running example, because we promised to drag the same one through every episode.

MayaSame one. One team, one coding agent, one painful job: a dropped-bug ticket in a real repository. Failing test, a handful of files, a fix that has to actually run and pass checks the agent never gets to see.

LeoAnd last episode the lever was "manufacture more tasks." Today the lever is —

Maya— change what the model knows before it ever touches your bug. Not more tasks. A different kind of pretraining. Let me name what that means, plainly. A world model, here, is the model's internal sense of cause and effect inside a program. You change this variable, run that line — here's the new state of the world.

LeoWorld as in "the little universe inside the running program."

MayaThe heap, the variables, the call stack, what's true after each step. Most models have a foggy version of that. CWM's pitch is: make the fog into a picture.

LeoHow do you even teach that? You can't read it off static code — that's the whole point.

MayaThis is the part I find genuinely clever. They mid-train on traces. Take a mountain of Python, actually execute it, and record observation-action sequences — the line about to run, then the real state right after. Over and over, millions of times.

LeoSo instead of feeding it the recipe, you feed it footage of the dish being cooked.

MayaExactly the move. And they don't stop at a single interpreter. They also drop the model into agentic Docker environments — real containers — and record what happens when an agent pokes at a real system. Commands, outputs, file changes. So it's two flavors of footage: the fine-grained "what does this line do," and the coarse "what does this whole environment do when you act on it."

LeoPause on "mid-train" for a second, because that word's doing work. That's not pretraining and it's not fine-tuning.

MayaIt sits between. You've got a base model that already read the internet. Mid-training is a middle layer where you pour in a specific kind of data — here, execution traces — before the later instruction-tuning and reinforcement-learning stages. Think of it as a semester abroad in "how programs behave," taken after the general degree, before the job training.

LeoSemester abroad. Okay.

MayaAnd the headline capability that falls out is the party trick from the top. You can hand CWM a function and an input and ask it to predict the execution trace — narrate the state after every line — with no interpreter in the loop. It's doing neural execution. Simulating the run.

LeoAlright, here's where I plant a flag, because this is the exact spot the last episode trained me to be suspicious. Predicting traces is a cute demo. The number anybody actually cares about is: can it fix the bug? Did the agentic coding score move? And those two things are not the same thing.

MayaGood. Hold that — it's the whole fight.

MayaLet me put the numbers on the table first so we're arguing about something real. It's a thirty-two-billion-parameter model, open-weights — checkpoints released, which for this topic matters enormously, we'll come back to that. On SWE-bench Verified — the real-bug-fixing benchmark, our dropped-ticket job — it lands around sixty-six percent with test-time scaling.

LeoFor a thirty-two-billion open model, that's... genuinely strong. That's competitive with much bigger things.

MayaAnd it's not a one-trick coder. Strong on the live coding benchmark, near the top on the hard math sets. So the model's real. The question you're actually asking is —

Leo— was it the world-model footage that did it? Or was it the RL on verifiable environments at the end, which is the thing everyone does now? You could've gotten that SWE-bench number without the trace stuff at all.

MayaAnd that's the attribution fight, right there. Let me take the world-model side honestly, not as a fan. Here's the strongest version. The trace mid-training and the SWE skill aren't two separate ingredients — the claim is one feeds the other. An agent fixing a bug is, secretly, doing execution prediction the whole time. "If I change this line, what breaks downstream?" That is a world-model query. So the bet is the footage builds the muscle the RL then sharpens.

LeoNice story, though. A clean story. What would actually convince me is an ablation — same model, same RL, trace mid-training turned off — and you show me the SWE number drops. Otherwise you're crediting the celebrity again.

MayaAnd that's the fair demand, and to the paper's credit it's framed as a testbed, not a victory lap — open checkpoints at each stage exactly so people can run that experiment. So let me concede the hard part: from the headline number alone, you cannot prove the traces caused the SWE gain. That's real.

LeoThank you. That's the honest version.

MayaBut here's the piece that survives the concession. Even if trace mid-training added zero to the SWE score — which I doubt — it bought a different thing that the score doesn't capture. A model that can predict execution can be asked to show its simulation. You can interrogate why it thinks the patch works.

LeoMm. So the capability isn't only "higher score." It's "inspectable reasoning."

MayaRight — and that's where I'd land it. On "did it win the benchmark," the verdict is open — and the open weights are an invitation to settle it. On "did it add a capability the benchmark doesn't even measure" — predicting and exposing what code does — yes, clearly. Two different questions, and they've got two different answers.

LeoTrain the world model, then judge it on more than the leaderboard. Okay. I'll take that split.

MayaAnd notice that's the topic's reflex again. The number went up. We refused to hand the credit to one cause, and we found the artifact hiding behind the score.

LeoWhich — let's do that. Walk it backwards. Lever, today, is the trace mid-training. The setup's the slippery one, like I said.

MayaThe setup's the open question, agreed. Result — strong SWE and math numbers from a mid-size open model. Artifact — and this is the part I'd circle in red.

LeoThe artifact's not the model?

MayaThe artifact is the recipe for the footage. The pipeline that turns ordinary Python into observation-action traces, and the choice to record real Docker-environment interactions. Anyone can now think: what would my agent learn if I trained it on traces from my systems?

LeoSo Monday, the implication is — I stop thinking of execution as something that happens at test time, and start thinking of it as training data.

MayaExactly the shift. Your CI logs, your interpreter runs, your container sessions — those aren't exhaust. They're world-model fuel. The running program was teaching the whole time and nobody was recording the lecture.

LeoNobody recording the lecture. I like that.

MayaLet me make the world-model payoff concrete on our actual bug, because "predicts state" is still abstract. Say the dropped ticket is a function that builds a cache, and the bug is it returns stale data on the second call.

LeoClassic. Works once, lies the second time.

MayaA read-the-code model sees the function, sees a cache, sees a return — all the shapes are correct, nothing's misspelled. It has no reason to suspect line twelve. A model with a world model runs it forward twice in its head and notices: on the second call, this key was never invalidated, so the value it hands back is the old one.

LeoIt doesn't spot the bug by recognizing a bad pattern. It spots it by simulating two calls and watching the state go wrong.

MayaThat's the entire difference between proofreading and tasting, dragged onto your specific ticket. And it's why this sits at the front of the topic — every later gym assumes the agent has at least some grip on what code does, not just how it reads.

LeoNow give me the part the paper's honest about. Where does the simulation break?

MayaA few places, and they don't hide them. The neural execution is impressive but it's not a real interpreter — it can be confidently wrong, especially on long or weird runs where the state has to be tracked over many steps. The fog doesn't fully clear; it thins.

LeoSo the model can narrate a trace that sounds right and isn't.

MayaWhich — and this is a lovely setup for where we go next — means the world model itself needs debugging. If your model's internal simulation can be wrong, somebody has to find where it diverges from reality.

LeoWhich is literally the next episode, isn't it. A bug in the bug-finder.

MayaThe thing that's supposed to know what code does, occasionally not knowing what code does. We pick that thread up next time.

LeoOther limits worth flagging?

MayaThe usual size honesty — thirty-two billion is strong-for-its-class, not the absolute frontier, and the authors are clear this is a research platform, early results, "here's a testbed," not "we solved agentic coding." And the deepest open question is the one you swung at: how much of the win is the world-modeling versus the verifiable-environment RL on top. They handed everyone the checkpoints precisely so the field can answer it instead of taking their word.

LeoWhich is the most useful kind of paper, honestly. It ships the argument and the means to check the argument.

MayaThat's the artifact-over-trophy thing again. A closed model that scored the same would be a press release. Open checkpoints at every stage is a laboratory.

LeoLet me say the spine back. The lever is teaching a model what code does by training it on footage of code actually running. The payoff is a model that can simulate execution, which both helps it fix bugs and lets you inspect its reasoning. And the catch is the simulation can be wrong, and you can't yet cleanly prove the traces — not the RL — earned the benchmark.

MayaThat's it. And the line I'd leave you holding: a model that only reads code knows what a program is supposed to mean. A model with a world model knows what it's actually about to do. The gap between those two is where most bugs live.

LeoSo here's what I'm chewing on. If a model can predict what your code does without ever running it — accurately enough to fix bugs — what's the first thing you'd stop spending compute on, and what's the first thing you'd never trust it on?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents