William Liu · Podcasts
A flat-vector editorial illustration of a benchmark grading workbench. A human review lens hovers over a repository-issue card and a hidden-test gate, a cyan check-mark clears fair task cards onto the bench while an amber tray discards broken ones, and a mint trace ribbon runs from a cleared card through two test gates to a result panel.

T2E7 · 00:11:56

SWE-bench Verified

SWE-bench Verified is the human-cleaned subset of SWE-bench that became the default coding-agent score. This episode explains why the original benchmark's auto-harvested tasks were noisy — underspecified issues, over-strict hidden tests, broken environments — how human engineers (with OpenAI) validated 500 tasks to fix that, how the hidden fix-flip and guard tests decide "percent resolved," and what a clean number still cannot tell you about other languages, feature work, review quality, or contamination.

Transcript

MayaPicture a grading room. A coding agent just submitted a patch for a real bug. The grader reaches for the answer key, and the key says: this issue is impossible to fix from the description given. The agent never had a chance — but the scoreboard still wrote down "failed."

LeoOof. So the agent gets blamed for the test being broken.

MayaRight. And that exact problem — the answer key itself being wrong — is the thing today's source was built to fix. Before any agent gets graded, a human walks through the room and pulls every broken answer key off the wall.

LeoOkay, so last time we were talking about visible tests versus hidden tests versus a human or model verifier — the different ways you can judge a patch.

MayaThat's the thread. Today we narrow from "how do you judge" to one specific instrument that a huge chunk of the field judges *with*. It's called SWE-bench Verified.

LeoAnd "Verified" is doing a lot of work in that name, I'm guessing.

MayaIt is. But let's not start with the fix. Let's start with the original, because Verified only makes sense as a repair job.

LeoPlain version first — what is plain SWE-bench?

MayaSWE-bench is a benchmark that asks one blunt question: can a language model resolve a real GitHub issue? Not a toy puzzle, not a leetcode prompt. An actual bug report, filed against an actual open-source Python project, with an actual repository sitting underneath it.

LeoSo the input isn't "write a function that does X." It's "here's a messy codebase and a complaint, go."

MayaExactly. The model gets the issue text and the repo at the moment just before someone fixed it. Its job is to produce a patch — a diff — that makes the project behave correctly.

LeoAnd how do they know it's correct? Someone has to decide.

MayaThis is the clever part, and it's worth slowing down on. Each task comes from a real pull request that *already fixed* the issue in the project's history. That pull request came with tests.

LeoAh. So the tests already exist. They're not invented for the benchmark.

MayaThey're harvested. And they split into two roles. The first role is the catch test — a test that fails on the broken code and passes once the bug is fixed. If your patch is real, that test should flip from red to green.

LeoSo that's the "did you actually fix it" signal.

MayaRight. Call it the fix-flip. And the second role is the guard test — a set of tests that were already passing and must *stay* passing. They're there to catch you if you fixed the bug by smashing something else.

LeoHmm. So the guard tests are basically "did you break the rest of the house while patching the one window."

MayaPerfect way to say it. A patch only counts as resolved if the fix-flip goes green *and* the guard tests stay green. Both. And here's the detail that matters most for agents: the model never sees these tests.

LeoWait — the tests are hidden?

MayaHidden. The agent gets the issue and the code, writes its patch blind, and only afterward does the harness run the held-out tests against it. So you can't game the grader by writing code that hugs the test cases, because you don't know what they are.

LeoOkay, that's a genuinely strong design. Real issues, real repos, hidden tests, a clean pass-fail. So... what went wrong?

MayaScale exposed cracks. The original set was a couple thousand of these tasks, pulled automatically from a dozen popular Python projects. And when you harvest a couple thousand things automatically, you harvest a lot of garbage along with the gold.

LeoGarbage how? Give me the failure shapes.

MayaLet me hand you three, because they're the whole reason Verified exists. The first is the underspecified issue. The bug report says something vague — "it crashes sometimes" — and the real fix depended on context that lived in a side conversation the model never gets.

LeoSo no human could solve that from the prompt alone, let alone an agent.

MayaCorrect. The second shape is the over-strict test. The hidden test doesn't just check that the bug is fixed — it checks that you fixed it the *exact* way the original developer did. Same variable names, same error message, same formatting.

LeoOh, that's nasty. So a perfectly correct patch fails because it didn't match one specific person's style.

MayaThat's the trap. A right answer, marked wrong. And the third shape is the flaky or broken environment — a test that fails for reasons that have nothing to do with the patch. A missing dependency, a timing issue, a test that was already failing before anyone touched it.

LeoSo between those three, the scoreboard is lying in both directions. Some agents look worse than they are, and the whole number gets noisy.

MayaAnd noise is poison for a benchmark, because the entire point is to compare systems. If a chunk of your tasks are unfair or unsolvable, you can't tell whether a higher score means a better agent or just a luckier draw of tasks.

LeoOkay. So that's the disease. What's the treatment?

MayaPut a human in the loop before the agents arrive. That's SWE-bench Verified. It's a subset — five hundred tasks pulled from the original couple thousand — where human software engineers read every single one.

LeoAnd this was done with OpenAI, right? I remember the collaboration.

MayaIt was a collaboration with OpenAI, yes. And I want to be careful here, because the *how* matters more than the *who*. The annotators weren't solving the tasks. They were checking the answer keys.

LeoSo what's the actual checklist a human runs down?

MayaThree questions, mostly. Is the issue clearly specified enough that a competent engineer could understand what's being asked? Are the hidden tests fair — do they accept reasonable correct solutions instead of demanding one exact phrasing? And is the task actually solvable in the environment as given?

LeoAnd if a task fails any of those?

MayaIt gets pulled, or flagged. The result is a smaller set, but a *trustworthy* set. The number that comes out the other side — percent resolved — now means something closer to "percent of fair, solvable, real-world bugs this agent actually fixed."

LeoLet me make sure I've got the move. The original gave you scale but smuggled in unfair tasks. Verified trades some scale for a human-cleaned answer key, so the score is a cleaner signal.

MayaThat's the whole episode in two sentences. And it's why, when you see a leaderboard today, it almost always says "Verified" next to the number. The plain version became the noisy version, and Verified became the one people quote.

LeoWhich is interesting, because the cleaning is human labor. You can't automate "is this issue fairly described."

MayaYou can't, and that's both its strength and a cost. Five hundred hand-checked tasks is expensive to build and doesn't refresh itself. Which brings us to the part I always want listeners to hold onto — the limits. A clean number is still a narrow number.

LeoGo on. What does a percent-resolved on Verified *not* tell me?

MayaStart with what it's made of. These are Python projects. Open-source Python projects, specifically. So a strong score is evidence about Python bug-fixing, and it says almost nothing about your Go service, your Rust kernel module, or your front-end in some framework nobody benchmarked.

LeoRight, the English-and-Python skew. The issues are written in English, the code is Python, and the world is bigger than that.

MayaMuch bigger. Second limit — and this is the one that haunts every benchmark — contamination. These are public GitHub issues with public fixes. If a model was trained on a scrape of GitHub, it may have *seen* the actual patch during training.

LeoSo it's not solving the bug. It's remembering the solution.

MayaAnd you often can't tell the difference from the outside. A high score could be reasoning, could be recall, could be a blend. That's not a knock on the benchmark — it's a permanent tension when your tasks come from the public internet.

LeoAnd here's the one I keep coming back to as a builder. Even a perfectly clean, uncontaminated score is a score on *issue-shaped work*. Issue comes in, patch goes out, hidden tests judge it.

MayaThat's the deepest limit, and you named it. Verified measures one shape of engineering: localized bug fixing with a known-good test oracle. It does not measure designing a feature from scratch. It does not measure a long terminal session. It does not measure whether the patch is *readable*, or whether a human reviewer would actually merge it.

LeoSo a team that treats Verified as "the coding agent score, full stop" is making the exact mistake the benchmark's own authors were trying to prevent.

MayaThey cleaned the instrument precisely so people would trust the reading. The irony would be trusting it for measurements it never claimed to take. A thermometer that's beautifully calibrated still can't tell you the weight of the room.

Leo[chuckle] Okay, that one's going in my notes.

MayaAnd there's a connection back to our series spine here. When a training team actually uses Verified, the valuable artifact isn't the final percentage. It's everything underneath each task — the issue, the repo snapshot, the hidden tests, the agent's trajectory, the pass-fail labels, and crucially, the note that says "this is Python, this is issue-shaped, this might be contaminated."

LeoCollect the work trace, not just the score.

MayaThat's the habit. The score is a headline. The replayable task underneath is the actual data product. Verified's real gift wasn't the number — it was proving that a human pass over your answer keys changes what the number is even allowed to mean.

LeoSo the lesson generalizes past this one benchmark. Any time you're handed an evaluation set, the first question isn't "what's the score," it's "who checked the answer key, and against what."

MayaAnd whether they wrote down what they pulled out, and why. A benchmark that hides its filtering is just a different kind of noisy.

LeoHere's a question to sit with, then. If you had a budget to hand-validate just one slice of the evaluation set your own team relies on — the task descriptions, the tests, or the environment setup — which slice would buy you the most trustworthy number?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents