T4E2 · Jul 2, 2026 · 00:14:42

T4E2 · Debugging Code World Models

Topic 4's second deep dive takes the code world model from the last episode and puts it on the workbench. Maya and Leo walk through Debugging Code World Models — a diagnosis, not an improvement study — that asks where a model's in-head simulation of a running program quietly breaks. They name two close-up failures: honest traces that write the full state after every step get token-heavy enough to exhaust the budget on long programs, and errors that pile up in string-valued variables because subword tokenization lets the model see strings only in misaligned pieces. Then they stage the real split on the long run: is the step-thirty drift a fundamental Transformer limit on state tracking, or something more fixable? The paper's controlled permutation-tracking benchmark separates generating the next action from propagating state — and when correct actions are handed in, state tracking holds, so the drift was wrong-action generation, not broken memory. The throughline: the artifact here is a failure map, and a simulator you can't trust everywhere is still useful if you know exactly where not to trust it.

Transcript

LeoTry this. You've got a model that swears it can run code in its head — narrate every variable after every line, no interpreter. You hand it a long function and ask for the play-by-play. It nails the first dozen steps. Then somewhere around step thirty, quietly, the story it's telling stops matching what the program actually did.

MayaAnd nothing flashes red. It keeps narrating in the same confident voice.

LeoThat's the unsettling part. It doesn't crash, doesn't say "I lost the thread." It just keeps going, wrong, in full sentences.

MayaSo today's question isn't "can the model simulate execution." We answered that last time — it can. Today's is the autopsy. Where does the simulation break, and why.

LeoLast time we built the tool that simulates the run. This time somebody's debugging the tool.

MayaWhich is the thread I left dangling at the end of CWM — the world model itself can be confidently wrong, and somebody has to find where. This paper is somebody. The title's just "Debugging Code World Models," and it's exactly that literal: it takes the world model apart on the bench and labels the broken parts.

LeoGood. Let's open it up.

MayaQuick orient. Same gym we toured last episode — where the agent learns what code does when it runs, not how it reads. Last time we admired the equipment. Today it's on the workbench with the panel off.

LeoLess "look what it can do," more "here's the diagnostic clinic." And the patient's our usual one.

MayaThe same team we've followed all topic — one coding agent, one dropped-bug ticket in a real repo. Today's twist: the agent's leaning on its internal simulator to reason about that fix — "if I touch line twelve, what happens on the second call?" If the simulator lies to it partway through, the whole fix is built on a bad mental model. Debugging the world model is debugging the bug-fixer's brain.

LeoA bug in the bug-finder. So what's the lever in this study? Last few episodes it was always "we changed the training." What did they change here?

MayaNothing. That's what's different. This isn't an improvement study with a lever — it's a diagnosis. Not "we made it better," but "we figured out, precisely, how it's broken, so the next person can." Different shape of paper, and the topic needs both.

LeoInstead of "up because of what," it's "wrong because of what."

MayaSteal the reflex, point it at the failure instead of the win.

LeoThen let's diagnose — where do they look?

MayaTwo exams. One's up close — single steps, small stretches of real code: does the model get this line right? The other's long-haul — can it track state over a long run without drifting? Call them the Close-Up and the Long-Haul.

LeoOne's "do you get the details," the other's "do you lose the plot."

MayaThat's the carve, and each surfaced its own failure. Start with the Close-Up — the first finding's almost funny in how mundane it is. The trace gets too long. Every step, the model writes out the full runtime state — every variable, what it now holds. Faithful. Also enormous. A program with a long execution history produces a trace so token-heavy the model runs out of room before the program finishes.

LeoWait. So it doesn't get the answer wrong — it runs out of budget getting there?

MayaIt exhausts the token budget. The paper calls it exactly that. The honesty of the representation — write everything after every step — is what sinks it on long programs.

LeoThat's a real tension, though. The whole pitch last episode was "show your work, write down the state, that's why it's inspectable." Now the same honesty is the failure mode?

MayaThat's the trap, and you've got your finger on it. The property that makes it trustworthy — explicit state — is the property that makes it expensive. You can't have the detailed confession and the short transcript at once. Something gives.

LeoEither it writes less and loses the inspectability, or writes everything and chokes on anything long. And "anything long" is most real software.

MayaRight. Our dropped-ticket bug isn't a five-line toy — it's a function deep in a call stack that's already done a thousand things. Token exhaustion isn't an edge case there. It's the common case.

LeoHuh. That reframes it.

MayaSecond Close-Up finding, sneakier. When the model does get a step wrong, the errors aren't spread evenly. They pile up in one place — variables that hold strings. Text values.

LeoStrings specifically? Not numbers, not lists — words.

MayaDisproportionately strings. And here's the diagnosis I find genuinely sharp. It's not that strings are conceptually harder to track. It's tokenization. The model doesn't see a string as one clean thing — it sees it chopped into subword pieces, and the seams don't line up with where the string's meaning lives.

LeoOh — so it's not a reasoning failure. It's a perception failure. The model can't cleanly see the object it's supposed to track.

MayaThat's the reframe. Picture tracking a word someone keeps handing you torn into random fragments — "str", "aw", "berry" — reassembling it every step. You'd drop characters too. The bug isn't in the thinking. It's in the eyes.

LeoWhich means when the agent reasons about a filename or an error message in our bug — a string — that's exactly where its simulation quietly corrupts. And half of real debugging is strings.

MayaPaths, keys, messages, formatted output. Right where it's weakest.

LeoHold on, though — that's two findings about the Close-Up. The thing I worried about at the top was the long run, the model drifting at step thirty. That's the Long-Haul exam. What did that one say?

MayaThis is the heart of the paper, and the part I want to argue about — because there are two honest readings of it. To study the long run cleanly they ditch messy real code — too many confounds — and build a controlled little world: a permutation-tracking benchmark. Picture a row of cups, and the only thing that ever happens is "swap cup three and cup seven." Swap, swap, swap, hundreds of times. One question: after all those swaps, do you still know where every cup is?

LeoA pure state-tracking test. No clever logic, just "keep the bookkeeping straight over a long sequence."

MayaStrip out everything except memory over time. Now — there's a famous worry hanging over this. Transformers, the architecture under these models, are known to struggle with exactly this kind of long-horizon state tracking. Real theory, real results, saying they degrade as the sequence gets long.

LeoRight, I've seen that. So the easy conclusion is: the model drifts at step thirty because Transformers just can't hold state that long. Architecture's the villain, game over.

MayaThat's the pessimist's reading, and I'll take it hard for a second, because it's the conventional wisdom and it's not stupid. The position: this is fundamental. Train all the traces you want — the substrate can't track state over long horizons. Stop polishing the supervision and go invent a new architecture.

LeoBut the paper actually ran the experiment. So let me take the other chair. There's a clean way to test "architecture, or something else." You separate two jobs the model's doing at once.

MayaName them.

LeoThere's generating the action — deciding "the next thing that happens is swap three and seven." And there's propagating the state — given that swap, correctly updating where the cups are. Two skills, mashed into one stream. When it fails, which half failed?

MayaThat's the whole experiment, and the result genuinely surprised me. They feed the model the correct actions — ground-truth commands handed in, so it never has to guess what happens next. It only does the bookkeeping. And under those conditions, the Transformer tracks state accurately over long horizons.

LeoWait — so the bookkeeping was fine? The thing everyone blamed the architecture for —

Maya— holds up, once you isolate it. State propagation isn't the bottleneck. The drift at step thirty was the model generating a wrong action — picking the wrong swap — then faithfully propagating from the wrong place. It wasn't losing track of the cups. It mis-read which move to make, then did the math correctly on the wrong move.

LeoHuh. So it's a bookkeeper who can add perfectly but keeps misreading the receipts.

MayaPerfect. The arithmetic's sound. The data entry's where it goes wrong. Completely different repair jobs.

LeoOkay, that moves me off the pessimist chair. Partly. I concede the headline: in this controlled benchmark, with actions handed in, state tracking survives — the architecture isn't the immediate villain. That's real, and against the grain.

MayaBut.

LeoBut it's a permutation benchmark. Cups and swaps. Real code isn't that clean — actions are tangled into the state, you can't always hand the model ground-truth moves in the wild, and the Close-Up findings say it's already struggling before we reach the long haul. So "state tracking is fine" is a lab result that still has to survive a real repository.

MayaFair, and I won't oversell it. But here's where we land, and it's not a draw. The pessimist throws out the architecture. The finding says don't — the expensive part, holding state over time, works. Spend your effort on the cheaper, fixable parts: picking the right action, how it sees strings, how long its traces get.

LeoThe disagreement resolves into a to-do list, not a verdict.

MayaA redirected to-do list. Both sides walk away with work, but the work moved — that's what a good diagnosis does. And notice we did the topic's thing again, inverted: refused to hand the failure to the obvious celebrity villain, "Transformers can't count," and went looking for which part actually broke.

LeoWrong because of what — pointed at a loss instead of a win. So bring it back to our bug. What does the autopsy change for the team with the dropped ticket?

MayaThree things, mapped to the three findings. Reasoning about a long function? Expect the simulation to thin out near the end — token exhaustion — so trust a deep trace less than a shallow one. A filename or an error string? Soft spot, where it silently corrupts. And when a long simulation goes wrong, suspect a wrong step it took, not a failure to remember.

LeoIt's almost a reliability map. "Here's where your in-head interpreter is least trustworthy — discount accordingly."

MayaThat's this paper's artifact — not a dataset or a model, a failure map. Last episode's takeaway was "execution is training data." This one's quieter: a simulator you can't trust everywhere is still useful — if you know exactly where not to trust it.

LeoLimitations the paper's honest about — where does the diagnosis itself stop?

MayaIt's a study of failure modes, not a fix — they map the broken parts, they don't ship the repaired model. The permutation benchmark is deliberately artificial, so the cleanest result lives in a clean world and still has to generalize. And it studies this explicit-state style of world model specifically — change how you represent state and some findings shift.

LeoThe usual caveat, in other words. "Here's what's wrong" isn't "here's the cure."

MayaBut it's what you need before the cure. You don't fix what you can't locate.

LeoLet me say the spine back. The world model can simulate execution, but it breaks three findable ways: its honest traces get too long and run out of budget, it corrupts strings because it sees them in broken pieces, and on long runs it drifts because it picks the wrong action — not because it can't track state.

MayaThat's it. And the line I'd leave you with: the headline last week was that the model can run code in its head. The quiet correction this week is that it runs code in its head the way you do at 2 a.m. — fine on the short stuff, fuzzy on the long stuff, and worst exactly where the words are. Knowing that is what makes it safe to use.

LeoHere's what I'm turning over, then. If your most useful tool is also confidently wrong in specific, mappable places — would you rather have a tool that's right less often but tells you when it's unsure, or one that's right more often but never flags its own blind spots?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents