Transcript
MayaPicture an agent that just fixed a billing bug. Green checkmark, all visible tests pass. Then a human reviewer opens the diff, reads four lines, and writes: this also silently disables refunds for one currency. Same patch — one instrument said ship it, the other said absolutely not.
LeoOh. So the score and the verdict point in opposite directions.
MayaThat gap is the whole topic. Evaluating a coding agent isn't reading one number off a leaderboard. It's asking several instruments what they each saw, and noticing when they disagree.
LeoAnd last topic set this up, right? We framed agentic coding as software work — repos, tools, environments, feedback loops, trajectories.
MayaExactly. Topic 1 was: what is the work. Topic 2 is: how do we judge it without fooling ourselves. And the twist that catches people — judging isn't only about the agent that *wrote* the patch. It's also about an agent that *reviews* a patch. Two different skills.
LeoOkay, before we go deep — can we define evaluation in plain language? It gets overloaded.
MayaSure. Evaluation here means turning messy agent behavior into trustworthy evidence. Not a vibe, not a single pass-fail stamp — evidence. Enough that someone who wasn't in the room can reopen the run and understand why the label is what it is.
LeoI like "evidence." It sets the bar higher than "did it pass."
MayaRight. And to keep us oriented while you're walking, the topic has four stations: a row of benchmark instruments, a wall of verification signals, a reviewer's desk, and a trust meter beside it.
LeoBenchmark instruments, verification signals, the reviewer's desk, and the trust meter. Those are my landmarks if I lose the thread.
MayaLet me plant one example we'll keep returning to. A small team ships a fix to an open-source data library. It has to do two things at once: pass a set of hidden tests they can't see in advance, and survive code review by a human who cares about the rest of the codebase.
LeoTwo gates. Hidden tests, and a human reviewer.
MayaTwo gates that can disagree. A patch can sail through the hidden tests and still get rejected in review for being unmaintainable. Or a reviewer loves it, but a hidden edge-case test fails. Hold that picture — we'll revisit it at every station.
LeoGot it. The patch that has to please a machine and a human at the same time.
MayaFirst station: the benchmark instruments. The key idea — there is no single instrument that measures "coding agent." There are families, and each family is shaped to see a different kind of work.
LeoGive me the shape of the families. Not a list I have to memorize — the spread.
MayaOne bench for fixing reported bugs in real repositories — the SWE-bench lineage. One for edit reliability across many languages — the Aider polyglot style. One for long, messy command-line work — Terminal-Bench. One for building whole features, not just patches — FeatureBench. And one where agents compete to improve a codebase over rounds — CodeClash.
LeoSo bug-fixing, multi-language editing, terminal grind, feature building, and competitive improvement. Same agent could look like a genius on one and a mess on another.
MayaConstantly. And the practitioner consequence — the family you pick sets the action space, the tools, the context, and the scoring function. Change the family, you change what counts as success.
LeoBack to our data-library fix — which family is that?
MayaClosest to SWE-bench: real repo, real issue, hidden tests decide pass-fail. But that family, by construction, can't see whether a human would *accept* the change. It's shaped to measure correctness, not reviewability.
LeoWhich is exactly where the disagreement starts.
MayaIt is. And here's the first real expert split. One camp says: standardize on a few strong public leaderboards — SWE-bench Verified being the obvious one — because shared, comparable numbers are how a field makes honest progress. Without a common ruler, every lab grades its own homework.
LeoStrong argument. Comparability is how you stop people cherry-picking.
MayaThe strongest case for leaderboards. The other camp says: the moment a benchmark is public and popular, it gets gamed and contaminated — training data leaks the answers in. So they push for fresh, private, refreshed evals — like SWE-bench-Live, which keeps pulling in new real-world tasks to dodge contamination.
LeoAnd their strongest point?
MayaA number you can overfit to stops measuring capability and starts measuring memorization. Most serious teams land on: you need both. The public leaderboard for comparability, the private eval as the lie detector.
LeoQuick recap before we move — the first question isn't "who's on top." It's "what work is this instrument even shaped to see, and could the agent have seen the answers already."
MayaSecond station: the wall of verification signals. Once you've picked an instrument, something has to actually decide pass or fail. That decider is not free, and it's not always trustworthy.
LeoThis is the answer-key question. Where do the labels come from.
MayaRight. And there's a spread of signal sources, each with a different blind spot. Visible tests the agent can run. Hidden tests held back so it can't tune to them. Oracle checks — a known-correct reference. Human review. And LLM-based verifiers — a model judging another model's work against criteria.
LeoGround it in the data-library fix. The hidden tests are the gate the team trusts most, right?
MayaThey trust them — but here's the trap. Tests are a label generator. If the hidden suite only checks the common case, a patch that breaks a rare currency passes anyway. Now you've got a confidently wrong label. And if you train on it, you reward the bad behavior.
LeoHmm. So a weak test isn't a small problem. It's a poisoned label factory.
MayaThat's the sharp version. Which is why there's a whole line of work attacking weak suites — strengthening tests, hunting the cases where a wrong patch still passes. Test quality *is* data quality. You can't separate them.
LeoOkay, and this is where the second big disagreement lives, I'm guessing — tests versus judgment.
MayaExactly the fault line. One camp says: executable tests are the only objective oracle. A test either passes or it doesn't — no opinions, no hallucinations, fully reproducible. Anything else is vibes dressed up as evaluation.
LeoHard to argue with reproducibility. The strongest case for tests-as-oracle.
MayaIt's a genuinely strong case. The other camp says: tests can't see most of what matters. Maintainability, intent match, whether the change is reviewable, whether it triggers a regression two modules over — no test captures that. For those you need human review or an LLM-style verifier that reasons about quality.
LeoAnd their strongest point?
MayaA patch that passes every test and is still un-mergeable is common. Real engineering rejects code for reasons tests never encode. Stop at the test gate and you're blind to half the job. Tools like LLM-as-a-Verifier exist precisely to score what tests can't — decomposing criteria and checking the trajectory, not just the final diff.
LeoSo the honest read is — tests give you a floor you can trust, but they're a floor, not a ceiling.
MayaBeautifully put. Trust the floor, don't mistake it for the whole building.
MayaThird station, and this is the one people skip: the reviewer's desk. Here's the reframe — code review is not a side effect of writing code. It's its own capability, and it deserves its own evaluation.
LeoSay more, because intuitively a good coder is a good reviewer.
MayaNot the same skill. Writing a patch is: given a problem, produce a change. Reviewing is: given a change you didn't write, find what's wrong — localize the defect to an exact line, judge severity, and say something a developer would act on. Different muscle entirely.
LeoBack to our example — this is the *human* gate on the data-library fix. The reviewer who said "this disables refunds."
MayaExactly. Now imagine an *agent* doing that job. To evaluate it, a benchmark can't just ask "did it find a bug, yes or no." It has to check: did it point at the right line, rank severity sensibly, and match what a real reviewer flagged on that pull request.
LeoSo the ground truth is real human review comments on real pull requests.
MayaThat's the move in this newer wave — CR-Bench, SWE-PRBench, CodeReviewBench, and others. They take real PRs, with the issues real reviewers caught, and ask whether the agent finds the same ones. CodeReviewQA even breaks it down: recognize the kind of change, localize it, identify the fix. Earlier work like CodeReviewer laid the groundwork by pretraining on review activity itself.
LeoAnd there's a production angle too, I think.
MayaThere is — work like MetaMateCR shows it at industrial scale: real comments, fixes applied, whether developers accepted them. That's the gap between a lab benchmark and "does this actually help the engineer on Tuesday."
LeoRecap for me — the reviewer's desk is a separate exam, graded against what humans actually flagged, not against whether tests pass.
MayaLast station, sitting right next to the desk: the trust meter. Because here's the failure mode — a review agent can be technically correct and still useless.
LeoWait, how does correct become useless?
MayaNoise. Say the review agent on our data-library fix posts twelve comments. One catches the refund bug. The other eleven are speculative nitpicks. What does the human do?
LeoStops reading by comment four. And probably misses the one that mattered.
MayaAnd that's the point. Bug recall — how many real defects you catch — is only half the story. The other half is precision and signal-to-noise. A reviewer that's right once and noisy eleven times can be worse than one that says less but is always worth reading.
LeoSo this is the last disagreement — recall versus precision in review.
MayaA real one. The recall camp says: a missed defect can ship a security hole or corrupt data. Better to over-flag than let a real bug through. Surfacing a true problem is worth some noise.
LeoStrong. A missed critical bug is unrecoverable in a way an annoying comment isn't.
MayaThat's their best argument. The precision-and-trust camp says: a reviewer nobody trusts catches nothing, because nobody reads it. Trust is the actual product. Lose it to noise and your paper recall never reaches a human who acts.
LeoAnd their strongest point is basically — an alarm that cries wolf gets unplugged.
MayaPrecisely. So the better benchmarks measure both, plus actionability and hallucinated issues — flagged problems that aren't real. Because the trust meter, not the raw catch count, predicts whether the tool survives in a real team.
LeoMy recap — correctness gets you on the desk. Trust is what keeps you employed there.
MayaBefore we close, one honest trade-off. Everything richer than a test — human review labels, verifier judgments, severity ratings — is more expensive and more subjective. Two senior engineers will disagree on whether a comment was "useful."
LeoSo the signals that catch what tests miss are the hardest to standardize.
MayaThat's the tension the topic lives in. Cheap, objective, narrow on one side; rich, subjective, expensive on the other. No free lunch — only a portfolio.
MayaLet's lock a few terms before the deep dives, quickly.
LeoHidden tests means a test suite kept secret from the agent so it can't tune its patch to pass them.
MayaOracle check means comparing the agent's output against a known-correct reference answer.
LeoVerifier means a model or system that judges whether work meets criteria, beyond simple pass-fail tests.
MayaTrajectory means the recorded path of the run — searches, file reads, commands, edits, retries — not just the final diff.
LeoDefect localization means pointing to the exact line where a bug lives, not just saying a bug exists somewhere.
MayaSignal-to-noise means the ratio of useful review comments to speculative or wrong ones.
LeoAnd contamination means benchmark answers leaking into training data, so a high score reflects memorization instead of skill.
MayaThat's the working vocabulary for the rest of Topic 2.
MayaSo here's the map into the deep dives. The benchmark instruments — the families and what each is shaped to see. The verification signals — visible, hidden, oracle, human, verifier, with their blind spots. The reviewer's desk — review as its own graded capability. And the trust meter — why precision and signal-to-noise decide whether a reviewer is real.
LeoAnd the show notes carry the actual instruments — patch benchmarks like SWE-bench Verified and Terminal-Bench, review benchmarks like CR-Bench and SWE-PRBench. Each one a different lens on the same question.
MayaWhich is the question to carry: never ask only "what scored highest." Ask what work was measured, what signal judged it, was that signal trustworthy, and what could still slip through. That holds for our data-library fix and for every leaderboard you'll read after this.
LeoHere's the one I'll leave people with. The next time you see a coding-agent beat a benchmark — would you trust it more if it caught the bugs in someone else's patch, or if it just passed the tests on its own?
Source material
- SWE-bench Verified
- SWE-bench-Live
- Aider Polyglot Benchmark and Leaderboards
- Terminal-Bench 2.0
- FeatureBench: Benchmarking Agentic Coding for Complex Feature-Oriented Development
- CodeClash: Benchmarking Goal-Oriented Software Engineering
- LLM-as-a-Verifier
- CodeReviewer: Pre-Training for Automating Code Review Activities
- CodeReviewQA
- CodeFuse-CR-Bench
- CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
- SWE-PRBench
- CodeReviewBench
- Code Review Bench by Martian
- MetaMateCR: AI-Assisted Fixes to Code Review Comments at Scale
- UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
← Back to Agentic Coding Capability: From Coding Models to Coding Agents