T2E7 · May 30, 2026 · 00:10:46

T2E7 · SWE-bench Verified

SWE-bench Verified is the human-cleaned subset of SWE-bench that became the default coding-agent score. This episode explains why the original benchmark's auto-harvested tasks were noisy — underspecified issues, over-strict hidden tests, broken environments — how human engineers (with OpenAI) validated 500 tasks to fix that, how the hidden fix-flip and guard tests decide "percent resolved," and what a clean number still cannot tell you about other languages, feature work, review quality, or contamination.

Transcript

MayaA coding agent just submitted a patch for a real bug, the grader reaches for the answer key — and the key itself is wrong. It says this issue is impossible to fix from the description given. The agent never had a chance, but the scoreboard still wrote down "failed."

LeoOof. So the agent gets blamed for the test being broken.

MayaRight. And that exact problem — the answer key itself being wrong — is the thing today's source was built to fix. Before any agent gets graded, a human goes through and pulls every broken answer key out of the set.

LeoSo this is the answer-key problem coming back to bite — we'd put visible tests, hidden tests, and a human or model verifier on that ladder of ways to judge a patch, and now the ladder itself is what's broken.

MayaThat's the thread. Today we narrow from "how do you judge" to one specific instrument that a huge chunk of the field judges *with*. It's called SWE-bench Verified.

LeoAnd "Verified" is doing a lot of work in that name, I'm guessing.

MayaIt is. But let's not start with the fix. Let's start with the original, because Verified only makes sense as a repair job.

LeoSo what is plain SWE-bench, then?

MayaSWE-bench is a benchmark that asks one blunt question: can a language model resolve a real GitHub issue? Not a toy puzzle, not a leetcode prompt. An actual bug report, filed against an actual open-source Python project, with an actual repository sitting underneath it.

LeoSo the input isn't "write a function that does X." It's "here's a messy codebase and a complaint, go."

MayaExactly. The model gets the issue text and the repo at the moment just before someone fixed it. Its job is to produce a patch — a diff — that makes the project behave correctly.

Figure 1: SWE-bench sources task instances from real-world Python repositories by connecting GitHub issues to merged pull request solutions that resolve related tests. Provided with the issue text andSource: 'SWE-bench: Can Language Models Resolve Real-World GitHub Issues?'

LeoAnd how do they know it's correct? Someone has to decide.

MayaIt's the same hidden fail-to-pass gate we built up back on the harness episodes — the fix-flip test that has to go red-to-green plus the guard tests that have to stay green, both harvested from the real pull request, and the model never sees any of them.

LeoRight, the held-out oracle — you write the patch blind, the harness runs the tests afterward, so you can't hug test cases you can't see.

MayaExactly that, lifted straight onto real GitHub issues. So the design is genuinely strong on paper.

LeoReal issues, real repos, hidden tests, a clean pass-fail. So... what went wrong?

MayaScale exposed cracks. The original set was a couple thousand of these tasks, pulled automatically from a dozen popular Python projects. And when you harvest a couple thousand things automatically, you harvest a lot of garbage along with the gold.

LeoGarbage how? Give me the failure shapes.

MayaLet me hand you three, because they're the whole reason Verified exists. The first is the underspecified issue. The bug report says something vague — "it crashes sometimes" — and the real fix depended on context that lived in a side conversation the model never gets.

LeoSo no human could solve that from the prompt alone, let alone an agent.

MayaCorrect. The second shape is the over-strict test. The hidden test doesn't just check that the bug is fixed — it checks that you fixed it the *exact* way the original developer did. Same variable names, same error message, same formatting.

LeoOh, that's nasty. So a perfectly correct patch fails because it didn't match one specific person's style.

MayaThat's the trap. A right answer, marked wrong. And the third shape is the flaky or broken environment — a test that fails for reasons that have nothing to do with the patch. A missing dependency, a timing issue, a test that was already failing before anyone touched it.

LeoSo between those three, the scoreboard is lying in both directions. Some agents look worse than they are, and the whole number gets noisy.

MayaAnd noise is poison for a benchmark, because the entire point is to compare systems. If a chunk of your tasks are unfair or unsolvable, you can't tell whether a higher score means a better agent or just a luckier draw of tasks.

LeoOkay. So that's the disease. What's the treatment?

MayaPut a human in the loop before the agents arrive. That's SWE-bench Verified. It's a subset — five hundred tasks pulled from the original couple thousand — where human software engineers read every single one.

LeoAnd this was done with OpenAI, right? I remember the collaboration.

MayaIt was a collaboration with OpenAI, yes. And I want to be careful here, because the *how* matters more than the *who*. The annotators weren't solving the tasks. They were checking the answer keys.

LeoSo what's the actual checklist a human runs down?

MayaThree questions, mostly. Is the issue clearly specified enough that a competent engineer could understand what's being asked? Are the hidden tests fair — do they accept reasonable correct solutions instead of demanding one exact phrasing? And is the task actually solvable in the environment as given?

Figure 6: We show an example of an formatted task instance, a model prediction, and the testing framework logs. In the patches, red highlights are deletions. Green highlights are additionsSource: 'SWE-bench: Can Language Models Resolve Real-World GitHub Issues?'

LeoAnd if a task fails any of those?

MayaIt gets pulled, or flagged. The result is a smaller set, but a *trustworthy* set. The number that comes out the other side — percent resolved — now means something closer to "percent of fair, solvable, real-world bugs this agent actually fixed."

LeoLet me make sure I've got the move. The original gave you scale but smuggled in unfair tasks. Verified trades some scale for a human-cleaned answer key, so the score is a cleaner signal.

MayaThat's the whole episode in two sentences. And it's why, when you see a leaderboard today, it almost always says "Verified" next to the number. The plain version became the noisy version, and Verified became the one people quote.

LeoWhich is interesting, because the cleaning is human labor. You can't automate "is this issue fairly described."

MayaYou can't, and that's both its strength and a cost. Five hundred hand-checked tasks is expensive to build and doesn't refresh itself. Which brings us to the part I always want listeners to hold onto — the limits. A clean number is still a narrow number.

LeoGo on. What does a percent-resolved on Verified *not* tell me?

MayaStart with what it's made of. These are Python projects. Open-source Python projects, specifically. So a strong score is evidence about Python bug-fixing, and it says almost nothing about your Go service, your Rust kernel module, or your front-end in some framework nobody benchmarked.

LeoRight, the English-and-Python skew. The issues are written in English, the code is Python, and the world is bigger than that.

MayaMuch bigger. Second limit — and this is the one that haunts every benchmark — contamination. These are public GitHub issues with public fixes. If a model was trained on a scrape of GitHub, it may have *seen* the actual patch during training.

LeoSo it's not solving the bug. It's remembering the solution.

MayaAnd you often can't tell the difference from the outside. A high score could be reasoning, could be recall, could be a blend. That's not a knock on the benchmark — it's a permanent tension when your tasks come from the public internet.

LeoAnd here's the one I keep coming back to as a builder. Even a perfectly clean, uncontaminated score is a score on *issue-shaped work*. Issue comes in, patch goes out, hidden tests judge it.

MayaThat's the deepest limit, and you named it. Verified measures one shape of engineering: localized bug fixing with a known-good test oracle. It does not measure designing a feature from scratch. It does not measure a long terminal session. It does not measure whether the patch is *readable*, or whether a human reviewer would actually merge it.

LeoSo a team that treats Verified as "the coding agent score, full stop" is making the exact mistake the benchmark's own authors were trying to prevent.

MayaThey cleaned the instrument precisely so people would trust the reading. The irony would be trusting it for measurements it never claimed to take. A thermometer that's beautifully calibrated still can't tell you the weight of the room.

Leo[chuckle] Okay, that one's going in my notes.

MayaAnd there's a connection back to our series spine here. When a training team actually uses Verified, the valuable artifact isn't the final percentage. It's everything underneath each task — the issue, the repo snapshot, the hidden tests, the agent's trajectory, the pass-fail labels, and crucially, the note that says "this is Python, this is issue-shaped, this might be contaminated."

LeoCollect the work trace, not just the score.

MayaThat's the habit. The score is a headline. The replayable task underneath is the actual data product. Verified's real gift wasn't the number — it was proving that a human pass over your answer keys changes what the number is even allowed to mean.

LeoSo the lesson generalizes past this one benchmark. Any time you're handed an evaluation set, the first question isn't "what's the score," it's "who checked the answer key, and against what."

MayaAnd whether they wrote down what they pulled out, and why. A benchmark that hides its filtering is just a different kind of noisy.

LeoHere's a question to sit with, then. If you had a budget to hand-validate just one slice of the evaluation set your own team relies on — the task descriptions, the tests, or the environment setup — which slice would buy you the most trustworthy number?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents