T2E12 · May 30, 2026 · 00:13:05

T2E12 · CodeClash

CodeClash evaluates coding agents on open-ended, competitive goals instead of fixed bugs. Two codebases are improved over many tournament rounds, then dropped into a shared arena where a winner falls out of the competition itself — no spec, no hidden test grading the diff, and an opponent that can't be memorized. Maya and Leo unpack what this measures, how it differs from issue-fixing benchmarks like SWE-bench, and the honest limits: results are noisy, the arena is synthetic, and the headline that top models lose every round to human experts measures one narrow skill — open-ended strategic development — not the well-scoped fixes models are already good at.

Transcript

MayaI'll say it flat out — you cannot grade a coding agent by handing it a hidden test suite anymore. The real test is another agent trying to beat you.

LeoWhoa, no, hold on — that's a step too far. Hidden tests still tell you if the code works.

MayaThey tell you if it passed a fixed answer key. They don't tell you if it can out-build a live opponent toward an open goal. Here's how today's benchmark sets it up: two teams, same starter codebase, one instruction — make this thing win. Not "fix this bug." Not "pass these tests." They each edit their code in private, then the two programs get dropped into the same arena and made to fight. Whoever scores higher takes the round.

LeoOkay — so the test isn't a checklist anymore. The test is the other guy.

MayaThat's exactly the move. And that flip — from "did you pass a fixed answer key" to "did you out-build a live opponent toward an open goal" — is the whole idea behind today's benchmark. It's called CodeClash.

LeoOkay, but before we get into it — this is the exact thing FeatureBench rewarded that CodeClash punishes, isn't it? Quick refresher for anyone who missed it.

MayaThat's the turn. FeatureBench rewarded hitting a spec cleanly — its central move was scoring agents on building whole features, multi-file, here's-a-spec, go implement it. Bigger task than a one-line fix, but still a fixed target: known right answer, test suite checks if you hit it. CodeClash takes that same instinct and punishes it, because there's no spec to hit at all.

LeoAnd today's paper sharpens that even further.

MayaIt removes the answer key entirely. FeatureBench still says "here is the feature we want." CodeClash says "here is a goal — improve retention, cut costs, survive longer — figure out what that means yourself and beat the other codebase at it." No spec, no hidden test waiting to grade your diff. Just a goal and an opponent.

LeoHmm. So how do you even grade that? With a normal benchmark I know what passing looks like. Here, passing is... beating someone?

MayaThe paper has a name for it — goal-oriented software engineering.

LeoAs opposed to?

MayaTask-oriented. Most benchmarks we've covered are task-oriented — a task is a bounded unit of work with a verifiable finish line. Goal-oriented means the finish line is open-ended. The goal is high-level, like "make users retain better," and there are infinitely many ways to chase it. The agent has to invent its own subtasks.

LeoSo it's the difference between "screw in this bolt" and "make this car go faster."

MayaThat's a good one. "Screw in this bolt" — I check it in a second, bolt's in or it isn't. "Make the car faster" — lighter body? bigger engine? better tires? Now you're making engineering judgments, and the only real test is the track, against another car.

LeoAnd CodeClash literally puts them on a track.

MayaIt does. Let me describe the loop, because the structure is the clever part. It runs as a tournament. Two agents, each with their own copy of a codebase. And every round has two phases.

LeoTwo phases — walk me through them.

MayaFirst, the editing phase — each agent works alone, reads its own code, makes changes, tries to improve. Then the competition phase — the two finished codebases get loaded into a shared arena and go head to head. The arena decides a winner by some objective measure. Maybe it's who scores more points in a game. Maybe it's who acquires more resources. Maybe it's who survives.

LeoWait — survives? Like the programs are fighting to stay alive?

MayaIn some arenas, yes, that's the literal win condition. The point is the win condition is concrete and external. Nobody hand-grades the diff. The codebases just compete, and the scoreboard falls out of the competition itself.

LeoOkay, that's wild. So there's no human and no hidden test suite reading the patch. The opponent IS the oracle.

MayaThat's the cleanest way to say it. In our topic's language, the verification signal isn't a fixed set of hidden tests anymore — it's a live adversary. And that solves a problem we've circled all topic: contamination. You can't memorize the answer to "beat whatever the other model does today," because it depends on a move that hasn't happened yet.

LeoRight — there's no answer sitting in the training data to leak. [chuckle] The test rewrites itself every round.

MayaExactly. And it's run at real scale. The paper runs many, many tournaments — thousands of them — across a huge number of rounds. So this isn't one cute demo match. It's enough repetition to see patterns in how models behave over a long development arc.

LeoLet me make sure I've got the contrast with SWE-bench straight, since that's been our anchor all topic. SWE-bench: here's a real GitHub issue, here's the repo, produce a patch, hidden tests say pass or fail. One shot, fixed target.

MayaOne shot, fixed target, binary grade. CodeClash is the opposite on almost every axis. Many rounds, not one. A moving target — the opponent — not a fixed one. And the grade is relative, who-beat-whom, not pass-fail. SWE-bench asks "can you fix this." CodeClash asks "can you keep making something better than a rival is, with nobody telling you how."

LeoAnd that "keep making it better over rounds" part — that's where it gets interesting, I bet. Because it's not one edit. It's edit, compete, see you lost, edit again.

MayaThat's the heart of it, and where the findings get honest. Now you're watching not whether the model writes good code once, but whether it can develop a codebase over time. And that turns out to be a very different skill.

LeoUh oh. I'm sensing the models don't come out looking great.

MayaThey come out looking human in their flaws, which is almost funnier. The paper finds models do show genuinely different development styles — some aggressive, some conservative, not all the same agent in a trenchcoat. But they share a deep weakness in strategic reasoning: trouble thinking several rounds ahead about what would actually win.

LeoSo they're tactical but not strategic. They can make this round's code a bit better, but they can't plan the campaign.

MayaWell put. And there's a second failure that any engineer will wince at. As the rounds pile up, the codebases get — the paper's word — progressively messy and redundant. The model keeps bolting on changes, never cleans up, and the repo slowly rots into a junk drawer.

LeoOh, that's painfully real. It's the agent equivalent of "I'll refactor it later" and then never refactoring it.

Maya[laugh] Right. No instinct to step back and pay down the mess. It just accumulates. And there's one more result that really sets the ceiling. When you put the top models up against expert human programmers in these tournaments, the humans win. The paper's framing is blunt — the top models lose every round against the human experts.

LeoHmm. Every round. That's not "close but behind," that's a clean sweep.

MayaA clean sweep, in this setting. And I want to be careful here, because this is exactly the kind of headline that gets overread. Let me name the limitations — the authors are upfront about them.

LeoYeah, please — because "AI loses to humans every round" is going to end up on someone's slide with no context.

MayaSo, the first caveat. This measures one specific thing: open-ended, competitive, multi-round development against an adversary. Genuinely hard, and a real gap. But it's not the same skill as "resolve this well-scoped bug," which is what SWE-bench measures — and where models actually do quite well. A model can be strong at bounded tasks and weak at open-ended strategy at the same time.

LeoSo it's not "agents are bad," it's "agents are bad at this particular open-ended, plan-ahead, beat-a-rival game."

MayaRight. The second caveat — competition outcomes are noisy. Who wins a head-to-head can swing on the arena's quirks, on luck, on the matchup. CodeClash leans on running enormous numbers of rounds so the signal averages out — but for any single match, you can't read too much into one win or loss.

LeoThat's a theme for the whole topic, honestly — never trust one number, trust the distribution.

MayaIt really is. And a third honest limitation: a synthetic arena is not production software. Winning a resource-grab game in a sandbox says something about strategic development, but it isn't maintaining a real service customers depend on for years. It's a proxy — a clever, contamination-resistant one, but still a proxy.

LeoOkay, so let me try to place this on the map we've been building all topic. We started at issue-fixing — fixed bug, hidden tests. We went through feature-building — bigger but still a known target. And CodeClash is way out at the far end.

MayaThe open-ended end. Picture the benchmark families as a row of instruments, each shaped to see a different kind of work. The issue-fixing one sees "can you hit a known target." The feature one sees "can you build a bigger known target." CodeClash sees "can you keep getting better at an unknown target while someone fights you." Same agent, three very different readings.

LeoAnd the reason you'd want that far-end instrument at all is — what, exactly? Why not just keep making harder fixed tests?

MayaBecause fixed tests have a ceiling problem and a contamination problem. Eventually models memorize the shape of your test, or you run out of harder hand-written tasks. A live opponent never runs out — the difficulty scales with the competitor automatically. A benchmark that, in principle, can't be saturated.

LeoHuh. So the opponent is also the difficulty dial. That's kind of elegant.

MayaThat's the elegant part. The not-so-elegant part is what we said — it's noisy, it's synthetic, and the headline is easy to weaponize. A good evaluator holds both: a real, hard, contamination-resistant signal about strategic development, AND one narrow lens that doesn't cancel out what models are genuinely good at.

LeoSo if I'm an engineer reading this, the takeaway isn't "don't use coding agents." It's more like "know which instrument you're reading." If my job is well-scoped fixes, the bounded benchmarks are the relevant signal. If my dream is an agent that runs a codebase on its own for months toward a fuzzy business goal —

Maya— then this is the instrument that just told you we're not there yet. The strategic-planning gap and the codebase-rot problem are the two things standing between a good one-shot coder and an autonomous engineer. CodeClash didn't invent those gaps. It built an arena that makes them impossible to hide.

LeoThat's a nice way to end it. The benchmark's real product isn't the leaderboard — it's the exposed weakness.

MayaThe exposed weakness, surfaced by competition instead of by a grader's opinion. Which brings me to the question I want to leave you with. If a coding agent only ever loses these tournaments because it can't plan ahead and can't keep its own codebase clean — would you rather we measured its raw ability to win, or its ability to know when it's making a mess and stop?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents