Transcript
MayaHere's a failure that has nothing to do with whether the model knew the answer. A model reasons its way to a perfect fix — logic right, code that would pass every test if you typed it in by hand. And the run is still scored a miss, because when the model handed over its edit, it handed it over in a shape the tool couldn't apply. The patch never made it onto disk.
LeoWait. So the brain was right and the hands slipped.
MayaThe hands slipped. And that gap — between knowing the fix and delivering it in a form a tool will actually accept — is the whole subject today.
LeoOkay, that's a different axis. Last time we were on SWE-bench-Live — the treadmill benchmark that keeps pulling in fresh GitHub issues every month so nobody can memorize the answer key. That was all about *is the score honest*.
MayaRight, freshness — can you trust the number. Today's source moves the question sideways. It's the Aider polyglot benchmark, and it stops asking only "did the model find the fix" and starts asking "can the model also *hand it over* cleanly, in a format the harness can mechanically apply, across several programming languages."
LeoSo delivery becomes part of the grade, not just the thinking.
MayaDelivery becomes part of the grade. That's the move. And let me set the plain-language version first, because "edit format" sounds like a technical footnote and it is absolutely not a footnote.
LeoYeah, unpack it. When a model "edits code," what's actually happening under the hood?
MayaPicture it physically. The model doesn't reach into your file and change a character. It produces *text* — a blob that's supposed to mean "change this file like so." And then a separate piece of software, the tool, has to read that blob and turn it into a real change on disk.
LeoSo there's a translation step — the model describes the edit, and the tool has to parse the description.
MayaExactly. And there's more than one way to write that description. Aider's writeup walks through a few. One way is brute force: hand back the *whole* file, changes baked in, and the tool overwrites the old file with the new one. Reliable to apply. The other way is a *diff* — the model only sends the lines that changed, here's the old snippet, here's the new one, splice it in. Surgical, but now the tool has to *find* that old snippet and swap it precisely.
LeoAnd if the model's "old snippet" doesn't exactly match what's in the file — a stray space, a line it misremembered — the splice fails.
MayaThe tool can't locate the spot to operate. So the diff is efficient but fragile — a real trade-off baked into how the model chooses to speak.
LeoHmm. So why use the fragile one? If whole-file always applies, why not always send the whole file?
MayaBecause whole-file doesn't scale — imagine resending a two-thousand-line file to fix one line, every turn. Real coding agents lean on diffs to stay fast and cheap, so the fragile format is the one that matters in practice. And here's the finding from Aider's writeup I love, because it's counterintuitive.
LeoGo on, what'd they find?
MayaThat the harder edit formats don't just cause more *formatting* slips. They make the model write *worse code*. The cognitive overhead of producing a tricky diff format eats into its ability to get the logic right.
LeoOh — that flips it. So it's not two separate skills side by side. Wrestling with the format actually degrades the thinking.
MayaThat's the surprise. You'd assume "solve the problem" and "format the answer" are independent. They're not — juggling a fiddly output structure spends some of the same budget the reasoning needed. The hands slipping can make the brain slip too.
LeoThat genuinely reframes it. Okay — so how does the benchmark turn this into a number? What's it made of?
MayaThe polyglot set is two hundred and twenty-five coding exercises, drawn from the harder end. They come from Exercism — an open learning platform full of small, self-contained programming puzzles.
LeoSelf-contained meaning — one file, write this function, here are the tests it has to pass?
MayaPretty much. Each exercise hands you instructions, a stub of a function, and a hidden set of unit tests. The model writes the implementation, the harness runs the tests, and pass or fail is whether the tests go green. Same hidden-test logic we've seen all topic.
LeoAnd "polyglot" is the headline word, so — many languages.
MayaSix of them. C++, Go, Java, JavaScript, Python, and Rust. The older Aider benchmark was Python-only. Polyglot stretches the same idea across six languages at once.
LeoWhy does the multi-language thing matter, though? A bug is a bug.
MayaBecause edit format and language interact in nasty ways. Take whitespace. In Python, indentation *is* the syntax — get a diff's indentation slightly wrong and you haven't made it ugly, you've changed what the code *means*. In a curly-brace language, indentation is cosmetic. Same diff-application machinery, totally different failure modes by language.
LeoAh. So six languages stress-tests the *delivery* skill, not just the reasoning — whether the format holds up when the rules of the page change. A polyglot test of handoff reliability, with correctness riding along.
MayaExactly. Which brings us to scoring, where format finally shows up. Two numbers, side by side. The first is the one you'd expect — the pass rate. What fraction of the exercises did the model actually solve, tests green. With one wrinkle: the model gets a second swing. If the first attempt fails, it sees the error output and tries once more, and the reported pass rate reflects that corrected attempt.
LeoA retry. So it's "can you get there, possibly after seeing what broke."
MayaRight, more like real coding — run it, it fails, read the error, fix it. The second number is special to this benchmark's philosophy. It's the percentage of responses that came back *well-formed* — the edit in a format the tool could cleanly apply.
LeoSo that second number is a pure delivery metric. It ignores whether the code was *correct* — only whether the edit was *applicable*.
MayaYou've got it. And reading the two together is the skill. A model with a high pass rate but a sagging well-formed percentage is smart but fumbles the handoff, so some of its good thinking never lands. And the writeup is explicit — failing an exercise only takes a breakdown in *one* step. Wrong code fails you. Right code in the wrong format also fails you.
LeoWhich is exactly your opening. The brain was right, the hands slipped, the run scores a miss. The well-formed number is how you *see* that, instead of lumping it into "the model's bad."
MayaThat's the diagnostic gift. Without it, a fumbled handoff and a genuine reasoning failure look identical — both just "didn't solve it." Split them, and the fix is obvious: a low well-formed rate doesn't call for a smarter model, it calls for a simpler edit format or a more forgiving parser. A knob on the harness, not the model.
LeoAnd tie it to our running team — the folks shipping the fix to that open-source data library. Their agent reasons out the correct change, but emits the diff with the context lines slightly off, and the run logs a failure that, from the dashboard, looks exactly like "the agent couldn't solve it."
MayaWhen really it *did* solve it and the plumbing leaked. The polyglot's two-number view saves them an afternoon of chasing a reasoning problem that isn't there.
LeoOkay. Now the part where you tell me where the benchmark itself is weak. Because everything has a soft spot.
MayaIt does, and the biggest is the honest one to lead with: these are self-contained exercises. Exercism puzzles are single-file, write-this-function problems with clean little test suites. That is *not* what shipping into a real repository feels like.
LeoRight — no sprawling codebase, no "this change ripples into four other modules," no figuring out which of nine hundred files even needs touching.
MayaNone of it. No navigation, no cross-file reasoning, no build system fighting you. So a strong polyglot score says the model can produce and *deliver* a correct edit when the problem's handed to it on a plate — not that it can survive a real repo. Other benchmark families measure that; no single instrument sees everything.
LeoAnd the second soft spot — let me guess, the one that's haunted this whole topic.
Maya[chuckle] Contamination. You know it's coming. Exercism is a public learning site — these exercises and people's solutions have been on the open internet for years. The benchmark's own writeup is candid that the material very likely sits in the models' training data already.
LeoSo a high score might partly be "I've seen this exact puzzle solved" rather than "I reasoned it out fresh."
MayaPartly. But here's the subtle bit — the *format* half of the score resists that. Even if a model memorized the Python solution, it still has to emit a diff the harness accepts, in whatever format you demanded, in whatever language. You can't memorize your way out of clean edit formatting across six languages.
LeoOh, interesting. So the well-formed number is partly contamination-proof even though the pass rate isn't.
MayaThat's a fair read. Correctness is the contamination-vulnerable half; delivery is sturdier, because applying edits reliably is a *behavior*, not a fact you read somewhere.
LeoAnd you mentioned the results wobble even at a fixed setting.
MayaThey're upfront about it. Even with the randomness dialed all the way down, the same request can come back several ways, so an exercise on the edge adds noise. And they admit they mostly didn't average across many runs, because running the whole thing repeatedly costs real money.
LeoSo a point or two between two models is inside the wobble. Read it as "roughly comparable," not "this one's better."
MayaExactly. Small gaps are noise, big gaps are signal — same discipline as every benchmark we've touched. The number is evidence, not a verdict.
LeoLet me land the whole thing. The deep idea isn't "another leaderboard." It's that for an *agent* — a thing that has to act on a codebase, not just chat about it — being right isn't enough. You have to be right in a form the machine can use.
MayaThat's the sentence. A model that talks beautifully about the fix but can't emit an applicable edit is, to an agent harness, useless. The polyglot benchmark makes that second skill — clean, cross-language handoff — visible and scoreable instead of invisible.
LeoAnd once it's in the light, you can engineer it. You can't fix what your dashboard quietly folds into "didn't work."
MayaWhich is the through-line of this whole topic — the headline number hides a more useful one underneath, and your job is to go find it.
LeoHere's one to sit with, then. If a model is brilliant at reasoning but unreliable at emitting edits a tool can apply — and the other is a slightly weaker reasoner that always hands over a clean, applicable diff — which one would you actually want driving an agent loose in your codebase?
Source material
← Back to Agentic Coding Capability: From Coding Models to Coding Agents