T3E8 · May 31, 2026 · 00:16:27

The Harness Problem

Same model weights, double-digit swings in success: this episode digs into Can Bölük's 'The Harness Problem,' which benchmarks 16 models and shows the edit format — how the agent tells a file what to change — can matter as much as a model upgrade. Maya and Leo walk through three editing dialects (patch, search-and-replace, and the content-hash 'hashline' that swaps copying text for pointing at it), why the weakest models gain the most, and the live fight over whether harness research should happen openly or stay locked inside each vendor's walls.

Transcript

LeoNo — I'm not buying it. You're telling me a model fails to fix a bug, and the *bug* isn't the model? It read the file, it picked the wrong fix, that's on the model.

MayaExcept in today's source that's exactly the case that doesn't hold. The model knows the fix — a guard clause got deleted, two lines up; it even says "I'll restore the missing guard clause." Then it tries to make the edit... and fails. Not because it misunderstood the bug. Because it couldn't *say* the change in a format the tool would accept.

LeoWait, it knew the fix.

MayaKnew the fix. Botched the *handoff*. The author's line is the model often isn't flaky at understanding the task, it's flaky at expressing itself.

LeoThen push on the part I still don't buy — "the model failed" and "the model couldn't express the edit" feel like the same sentence wearing two coats.

MayaThey're not the same, and the gap between them is the whole episode. Today's post is "The Harness Problem," by Can Bölük — a benchmark that pulls those two apart and measures the seam.

LeoAnd if you're right, the headline finding is—

MayaChange *only* how the model is allowed to phrase an edit — touch nothing in the weights — and success rate can swing by double digits. One model went from basically broken to mostly working. Same brain. Different form to fill out.

LeoSo zoom in. Where in the wrapper does this actually live?

MayaThe most concrete, almost insultingly mechanical piece *inside* it — one that's been quietly setting the score all series. And here's the swing: the meta-harness we just spent an episode on was an agent whose whole job was to *redesign* the harness by reading every past attempt in full — the optimizer building the wrapper, way up at the top. Today we drop back down to the single bolt at the bottom.

LeoThe piece being?

MayaHow the model tells a file what to change — the actual sentence the agent emits to edit your code. Sounds like plumbing. It is not.

LeoDefine it plainly first — I want to make sure I'm not picturing something fancier than it is.

MayaThe agent has decided what to do. Now it has to communicate that to the file. There are a few dialects it can speak in, and the harness picks which one. That choice — the *edit format* — is the thing. Not the reasoning. The grammar the reasoning fits through.

LeoAnd your claim is that grammar moves the score as much as a model upgrade.

MayaThat's the claim. Three dialects an agent can speak to edit a file: the Recipe, the Find-and-Replace, and the Tag.

LeoOkay, go.

MayaThe Recipe is the patch format — a unified diff. The model writes a structured blob: here's the chunk, the lines I'm removing, the lines I'm adding, in this exact shape. Like precise cooking instructions where one misplaced symbol ruins the dish.

LeoAnd that's fragile because—

Maya—because the model has to produce the *whole structure* perfectly, and most were never really trained to. In the benchmark, patch was the *worst* format for nearly every model. Grok 4 failed it about half the time. GLM, almost half.

LeoHalf? On editing a file? That's not a model that can't code. That's one that can't fill out the form.

MayaThat is *exactly* the insight, and you got there before I did. Second dialect: Find-and-Replace — the one most assistants use, including a lot of Claude tooling. The model says: find this exact block of text — every space, every indent — and swap it for a new one.

LeoWhich sounds bulletproof until the model misremembers one space.

MayaOne space. One tab versus four. Or the same line appears twice and it's ambiguous which you meant. Any of those, and you get the error every coding-agent user has seen: "string to replace not found." The post points at a GitHub megathread of people hitting exactly that wall.

LeoOh, I've lived that one.

MayaEveryone has. The model knew the change. The *copy* of the surrounding text was off by a whisker, and it silently failed.

LeoOkay, before the third one — there's an option people forget, right? Just rewrite the whole file.

MayaRight, full-file rewrite — Cursor's lane. Regenerate the whole file so you never *match* anything. Sidesteps the copying problem. But it's wasteful — re-emitting hundreds of unchanged lines to fix one — and falls apart on big files. Cursor's own analysis says full rewrites only beat diffs *under* about four hundred lines.

LeoSo below four hundred lines, brute force wins; above it, you're back to the fragile dialects.

MayaAnd here's the tell — Cursor cared so much they *trained a whole separate model* whose only job is to merge edits into files correctly. A dedicated neural net for the handoff.

LeoHang on. They threw an entire fine-tuned model at *applying an edit*? That's a company saying out loud: this is hard enough to deserve its own brain.

MayaThe quietest, loudest evidence in the post that the harness problem is real.

LeoOkay. Third dialect. The Tag. Sell me.

MayaThe failure mode in two of three dialects is the same: the model has to *reproduce existing text* perfectly, as patch context or as the find-block. Bölük's move is to delete that requirement. When you show the model the file, you tag every line with a tiny content hash — two or three near-random characters.

LeoGive me the picture.

MayaLine shows up as — tag `f1`, then the code. Tag `a3`, then the next line. To edit, the model doesn't copy the line back. It just says: replace the line tagged `f1` with this. He calls it hashline.

LeoWait. So the model never has to quote the original at all. It just names the line by its little label.

MayaNames it by label. And here's the clever part — those tags are derived from the line's *content*. So if the file changed underneath the agent since it last looked, the tag won't match, and the edit refuses instead of corrupting silently. The label is also a checksum.

LeoOh, that's nice.

MayaAnd a sneakier benefit. Bölük puts it as — if the model can recall a pseudo-random tag, chances are it knows what it's editing. The tag becomes a proof of attention. You can't fake remembering `f1` for the right line unless you tracked which line you meant.

LeoSo instead of asking "can the model photocopy text," you're asking "can the model point." And pointing is the thing it's actually good at.

MayaThat's the whole inversion in one sentence. Stop testing the model on transcription. Test it on intent.

LeoAlright, I trust mechanisms less than I trust tables. What did the measurements actually look like?

MayaGood — the setup matters. He mutates real files from the React codebase — flips a boolean, swaps an operator, drops a guard clause — then writes a plain-English bug report for each. Fresh agent each time, four tools: read, write, edit, search. Around a hundred and eighty tasks, three runs each, across sixteen models.

LeoOkay, that's a real harness, not a toy.

MayaAnd the results land hard. The headline: a small fast model, Grok Code Fast, went from under seven percent success to over sixty-eight — just by changing the dialect.

LeoWait, say that again. Seven to sixty-eight? That's not an improvement, that's a different model — except it's the *same* model.

MayaSame weights. Tenfold. Another more than doubled. And a beautiful side effect — one model cut its output tokens by about sixty percent, because it stopped flailing through failed-edit retry loops. Cheaper *and* better.

LeoAnd the strong models?

MayaHere's the pattern that makes the whole argument. The weakest models gained the *most*. A frontier model already threads the fragile formats well, so it has less to gain. The struggling model was never bad at coding — it was bad at the *form*. Fix the form, and its real ability shows.

LeoSo the gap we keep calling "model capability" is partly... a paperwork tax that hits the weak models hardest.

MayaPaperwork tax. Yes. And remove it and you find out the model knew the fix all along — it was the form it had to say it on.

LeoThere it is.

MayaAnd one more — switching Gemini to hashline gained about eight percent, and Bölük's line is that's bigger than most *model upgrades* deliver, at zero training compute. The whole benchmark ran for about three hundred dollars.

LeoI have to push there. "Bigger than most model upgrades," measured how? On *his* mutation benchmark, mechanical edits to one codebase. Real signal, but not a general capability jump.

MayaFair, I won't overclaim it. This measures *editing* — not reasoning, not architecture, not the hard creative parts. What it cleanly shows is narrower and still important: a big chunk of what we credit to the model is really the format it's forced to speak.

LeoI'll take that version. The narrow one is still a knife to the throat of "it's all the model."

MayaHere's the image I want for this one — a dictation line. The model is one person on a phone describing, line by line, the change they want made to a document they can't touch. A clerk on the other end makes the edit they hear.

LeoAnd the model's already worked out the fix.

MayaExactly the dropped-transaction bug — payments losing one in a thousand under load — it has *seen* it, it knows precisely which line to change. Now it has to speak that down the line.

LeoAnd if the line's full of static—

Maya—the clerk writes the wrong thing, and everyone blames the caller for a bad document. Patch and Find-and-Replace are a staticky line: spell every character of the surrounding text back perfectly or the clerk can't find the spot. Hashline is a clean line — the caller just says "the line tagged f1," and the clerk knows exactly which one.

LeoSame caller. Same fix in their head. The only thing you cleaned up was the line they say it over.

MayaAnd the success rate jumped tenfold. That's the episode.

LeoOkay. So if the line matters this much, here's where I get genuinely uneasy, and I don't think it's a clean win for the post.

MayaGood, because there's a real split here, and it's not the harness-versus-weights one — we settled that in the overview. This one's about *who is allowed to do harness research at all*. Take a side. I'll take Bölük's.

LeoI'll defend the vendors, not as villains. While Bölük was benchmarking, Anthropic cut off a third-party tool, OpenCode, from using Claude — called it a reverse-engineered private interface. And Google disabled his Gemini account mid-benchmark. He frames it as suppression. But run it from the vendor's chair. You ship a model with a *specific* harness you tuned end to end. Someone wires it into an untested wrapper, through an interface you never sanctioned. Now your model's behavior in the wild is something you can't reproduce or support. Of course you fence it.

MayaPart of that survives. But look at *what got fenced*. His benchmark showed Gemini hitting its best number — over seventy-eight percent — using *his* hashline trick, beating Google's own best config by five points. The community found capability Google hadn't shipped. And the response was to disable the account.

LeoSure, but "found capability" and "violated terms" can be the same act. The number being impressive doesn't make the access authorized.

MayaGranted — the access question is real and the terms exist. But here's the cost of winning that argument. If every vendor locks the harness to itself, nobody tunes the *cross-model* case. Anthropic won't optimize an edit format for Grok; xAI won't tune one for Gemini. The one place harness gains transfer across *everybody's* models — the open, shared layer — is exactly the place that gets shut.

LeoOkay, that's the part that moves me. The transfer argument. A community wrapper gets used against a dozen models by people who each fix the failure *they* personally hit. No single vendor has that surface area or that incentive.

MayaAnd Bölük reaches for an analogy from his old world, game security. When a team finds someone who got around their system, the mature move isn't a permanent ban — eventually you ask, "want to show us how you did it?" You recruit the cheater. Treat the person probing your harness as a *source*, not a threat.

LeoI'll concede the recruitment framing is the smarter long game. Where I won't fully fold: a vendor still has a legitimate interest in a supported, reproducible interface. So it isn't "vendors are wrong." It's — a sanctioned door for harness research, not a wall.

MayaA door instead of a wall. I'll take that as the resolution. We agree the lockdown has real costs to everyone's models; we disagree on how much vendors owe the people at the gate. And what would settle it is watching whether the open formats keep out-improving the private ones.

LeoWhich, conveniently, is a benchmark someone could run.

Maya[chuckle] It always is.

LeoBefore we close — limitations. He's not claiming the harness solves everything, right?

MayaNo, and credit to him for the honesty. Nobody hit a hundred percent — even the best dialect leaves real failures on the table. Other benchmarks back the messy picture: JetBrains looked across formats and models and found *no single dialect wins everywhere*. So hashline isn't a universal key. It's a strong default the data happens to love.

LeoAnd the open question he raises but doesn't answer.

MayaWhether this layer gets solved in the open, by a community hammering on many models, or quietly behind each vendor's walls. The mechanism is settled; the *politics* of who improves it is the live wire he ends on. One future, the best edit format is shared infrastructure. The other, a moat.

MayaSo here's what I want you sitting with. All topic, we've treated the model as the thing that *has* the capability and the harness as the wrapper that delivers it. Today blew a hole in that — change the format, hold the brain fixed, a broken model works.

LeoSo the question that leaves me—

MayaGo ahead, it's a good one.

LeoThe next time you read "our new model is so many percent better on coding" — how would you even *find out* how much of that gain is a smarter model, and how much is just a cleaner line between the model and the file it edits?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents