T2E3 · May 30, 2026 · 00:11:47

T2E3 · Leaderboard Literacy

A coding-agent leaderboard reports one number, but that number is a bundle — task shape, the room the agent worked in, the judge that scored it, language coverage, contamination risk, and the failure modes a ranking never shows. Maya and Leo pull the bundle apart across the real benchmark families (issue-fixing, edit-and-repair, terminal, feature-development, open-ended, and mobile) so you can read the next benchmark headline critically.

Transcript

MayaTwo coding agents post the identical headline number on the same leaderboard — same share of issues resolved, dead heat — and the work behind those two numbers has almost nothing in common.

LeoWait — identical score, but how?

MayaOne of them got there fixing curated Python bugs with one clean attempt and no human nearby. The other got there with fifty tries, a hidden reranking model picking the winner, and a few tasks quietly dropped because they were flaky.

LeoOh — so the number is identical and the work behind it is nothing alike.

MayaThat's the whole episode. A leaderboard reports one number, but a coding score is really a bundle — task type, environment, tooling, scoring method, languages, and what could be leaking. Today is about pulling that bundle apart.

LeoAnd after we walked the testing gates — where every test teaches the agent what behavior gets rewarded — this feels like the layer sitting on top of all that.

MayaExactly that. Tests were zooming into one signal. Today we zoom back out. A leaderboard sits on top of all those signals and compresses them into a single ranking — and that compression is exactly where you can get fooled.

LeoSo before we go further — what does a coding-agent leaderboard actually *do*?

MayaA leaderboard takes a fixed set of tasks, runs each system through them, scores every attempt the same way, and sorts everyone top to bottom. The promise is comparability — same exam for everyone.

LeoAnd the catch?

MayaThe catch is that "same exam" hides six different things that can vary underneath. I want to walk those as landmarks, so you can hear them coming on the next benchmark headline.

LeoGive me the map first, so I don't lose it if I'm half-listening on a walk.

MayaThe shape of the work. The room the agent works in. The judge. The languages. The leak. And the things that never show up in the number at all.

LeoShape, room, judge, languages, leak, and the invisible part. Got it.

MayaFirst landmark — the shape of the work. This is just: what *kind* of software task is this benchmark even made of?

LeoBecause "coding" isn't one thing.

MayaNot even close. There's issue-fixing — you get a real GitHub bug report and you produce a patch that has to pass hidden tests. That's the SWE-bench family, SWE-bench Verified being the human-cleaned version, five hundred issues that annotators confirmed are actually solvable and clearly stated.

LeoSo a high score there means... it's good at repo-level bug fixing in clean Python issues.

MayaExactly that, and only that. Now slide over to edit-and-repair — the Aider polyglot leaderboard. That's not big messy repos, it's a couple hundred self-contained exercises, and a big part of what it measures is whether the agent can *apply its own edit cleanly* — produce a diff that actually lands in the file.

LeoWait, that's a separate score? Whether the edit format is even valid?

MayaIt is, and I love that they split it out. You can have a model that knows the right fix but keeps mangling the diff so it never applies. That's a real failure mode, and it's invisible on a pure pass-rate benchmark.

LeoThen there's the terminal world, right?

MayaTerminal-Bench — can the agent drive a command line over a long horizon, set things up, run things, recover. Different muscle entirely. Then feature-development benches like FeatureBench, where the task isn't one bug, it's "build this feature end to end." Then open-ended ones like CodeClash, where agents basically compete and maintain a codebase over rounds. And mobile-app benchmarks for the on-device, UI-driven flavor.

LeoSo six rough shapes — issue-fixing, edit-and-repair, terminal, feature work, open-ended, mobile.

MayaAnd here's the move: a number from one shape does not transfer to another. Strong at curated bug fixing tells you almost nothing about shipping a feature across pull requests.

LeoMy recap — the shape tells you which claim the score is even allowed to make.

MayaSecond landmark — the room. Same model, two different rooms, two different scores.

LeoBy room you mean...

MayaThe environment and the tools wrapped around the model. The harness. One run hands the agent a real repo, file search, a terminal, test feedback, and a repair loop. Another run hands the same model a plain prompt and no way to run anything.

LeoAnd those can both end up on a leaderboard looking like "the model scored X."

MayaThat's the trap. Hmm — this is the one that bites people most. The leaderboard says "model," but the result is "model plus harness." Two labs can submit the same base model and land in different spots purely because their scaffolding differs.

LeoTake our running example — the order validator. The bug where the patch blocks negative quantities but still lets zero-quantity orders through.

MayaPerfect. An agent with a terminal and the test suite can run it, watch zero slip through, and fix it. The same model with no execution just guesses from the issue text. Same brain, very different odds — and that gap is the room, not the model.

LeoSo when I read a score, I want to know what was in the room.

MayaTools available, retries allowed, time limit, context policy, whether a human ever nudged it. A serious leaderboard tells you. My recap — same model, different room, different agent.

MayaThird landmark — the judge. How does a run get turned into pass or fail?

LeoAnd we know from last time tests can be weak.

MayaRight, so this layers on top. On issue-fixing benches the judge is usually hidden tests — your patch passes or it doesn't. On edit-and-repair it's pass rate plus that edit-format check. On open-ended or feature work, sometimes the judge is partly an LLM verifier or a rubric, not just a green checkmark.

LeoThose aren't the same kind of evidence at all.

MayaThey're not. And the scoring *method* hides choices too. A score of "resolved on the first attempt" is a very different animal from "best of fifty attempts with a reranker choosing the winner."

LeoOh, that's the dead-heat thing from the top.

MayaThat's the dead-heat thing. Both can be reported as the same percentage. One is asking "can you do it once," the other is asking "can you do it if you're allowed to try a lot and have help picking." Honest leaderboards report the sampling budget right next to the number.

LeoRecap from me — the judge decides what "pass" even promises.

MayaFourth and fifth landmarks I'll take together, because they rhyme — languages, then the leak.

LeoStart with languages.

MayaSWE-bench Verified is Python, full stop. Aider's polyglot set deliberately spreads across six — C++, Go, Java, JavaScript, Python, Rust — because edit reliability in one language doesn't guarantee it in another.

LeoSo a Python-only score is silent about your Rust service.

MayaCompletely silent. Language coverage is part of the claim, and a single number erases it. Now the leak — contamination.

LeoThis is the data-in-training-data problem.

MayaYeah. A static benchmark that's been public for a while can drift into training sets, blog posts, prompt libraries, tuning loops. Then a high score might be partly memory, not skill. That's the motivation behind a live benchmark like SWE-bench-Live — keep pulling fresh, recent issues so the tasks haven't had time to leak.

LeoDoes fresh fully fix it?

MayaNo — and that's the honest caveat. Freshness lowers the risk, it doesn't zero it out. What you actually want exposed is task dates, whether solutions are public, and whether the submission did benchmark-specific tuning. My recap for both — languages tell you where the score is silent, and the leak asks whether it reflects skill or recall.

MayaLast landmark — and it's the one I care about most. The things that never make it into the number at all.

LeoLike what?

MayaCost. Variance. The shape of the failures. A leaderboard can be completely accurate and still hide that the top system costs ten times more per task, or that its score swings wildly run to run, or that when it fails, it fails by quietly deleting tests to make things green.

Leo[chuckle] So the headline can be true and useless at the same time.

MayaThat's the uncomfortable trade-off. Leaderboards buy you comparability, but they pay for it with compression — everything interesting about *how* and *at what price* gets squeezed out of the ranking.

Figure 5: The Pareto frontier of agent performance showing the tradeoff between performance and cost (log scale) on Terminal-Bench 2.0Source: 'Terminal-Bench 2.0: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces'

LeoAnd there's a real disagreement here, isn't there? About whether to even chase public leaderboards.

MayaThere is. One camp says lean on shared public benchmarks — comparability is precious, everyone runs the same exam, progress is legible across labs. Their strongest point is that without a common yardstick, every vendor just grades their own homework.

LeoAnd the other side?

MayaThe other camp says the moment a benchmark is public and valuable, people optimize *to it* — they tune for the test and leak creeps in — so you should trust fresh, private evals on tasks that look like your own work. Their strongest point is that a number you can't contaminate is the only one you can fully trust. Honestly, mature teams do both: public for the shared map, private for the real decision.

MayaSo here's the compact hook for the whole thing. Never ask only "what scored highest?" Ask: what was the work, what room produced it, what judge graded it, in which languages, could it have leaked, and what did the number leave out?

LeoAnd for a training team, that bundle *is* the thing you store, not the score.

MayaExactly. Task, repo snapshot, the room and its tools, the trajectory, the judge's verdict, the language, the freshness, and a note on every known limitation. That's the replayable record — so someone else can reopen the run and trust the label without taking your word for it.

LeoWhich loops right back to the series promise — collect work traces, not just code.

MayaThat's it. A leaderboard starts the investigation. It should never end it.

LeoSo here's the one to sit with: the next time a coding agent posts an impressive score, which hidden setting — the room, the judge, the number of tries, or the freshness — would change your mind the most if you actually saw it?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents