William Liu · Podcasts
2D editorial illustration: a reviewer's loupe over a code-diff panel with a small paper receipt curling out beneath it; a cyan trace ribbon runs from the diff through a hidden-test gate and a short winding trajectory path to a needle trust meter, with a small rack of measuring dials for benchmark families. No text, no 3D.

T2E1 · 00:11:30

How to Evaluate a Coding Agent in Practice

The practical on-ramp to Topic 2: what you actually do when a coding agent lands a real change and someone needs a verdict. Maya and Leo walk the evaluation loop in four moves — pick the instrument that fits the work (bug-fix vs. feature vs. terminal), record what signal judged it (visible tests, hidden tests, oracle, human review), read the trajectory when the path carries signal the diff erased, and judge both the change and the engineering process. Built around one running example: a data-library fix that passes every hidden test but silently breaks refunds for one currency.

Transcript

MayaHere's a moment I want you to sit with. A coding agent finishes a task. It prints one line: resolved, sixty-two percent. Your boss asks, "Is it good?" And the honest answer is — that number alone can't tell you. Because two agents can both print sixty-two and have done completely different work to get there.

LeoHmm. So the number's real, it's just... not the whole story.

MayaIt's a receipt, not the meal. Same total at the bottom, wildly different baskets. One agent fixed curated bugs in a clean sandbox on the first try. The other took fifty attempts, had a hidden helper pick the best one, and quietly skipped the flaky tasks.

LeoOh. Same receipt, nothing alike in the cart.

MayaAnd that's the whole episode. We're not chasing a higher number today. We're building the workflow that tells you what a number is actually made of — so you can judge an agent instead of just reading its scoreboard.

LeoLast time we laid out the map for the whole topic, right? Four stations — the benchmark instruments, the verification signals, the reviewer's desk, and the trust meter beside it.

MayaExactly. The overview was the map. Today is the on-ramp — the practical loop you run when an actual agent lands an actual change and someone needs a verdict. Less "here are the regions," more "here's the walk you take through them."

LeoGood. Because in real life nobody hands me a leaderboard. They hand me a diff and a deadline.

MayaRight. So let's define the thing first, in plain language. Evaluating a coding agent means answering one question carefully: did this change do the job, and can I trust how I know that?

LeoTwo halves. The change, and the trust in the verdict.

MayaAnd here's the move that the rest of the episode unpacks — you never get that from one measurement. You ask several instruments, and the interesting information lives in where they disagree.

MayaFirst stop on the walk. Before you measure anything, you ask: what kind of work is this, and what world is the agent doing it in?

LeoWait — why does the world matter before the score?

MayaBecause the world decides what "success" even means. Fixing a known bug in one file is a different sport than building a feature across five. Driving a terminal for an hour is different again. If you score them on the same ruler, you learn almost nothing.

LeoSo step one isn't "run the test." It's "what am I even watching."

MayaPicture a rack of instruments, each shaped to see one kind of work. There's a bug-fix gauge — the SWE-bench Verified style, real GitHub issues, a repo, and hidden tests the agent can't peek at. That measures: can you land a correct patch in a real codebase you didn't write.

LeoAnd SWE-bench Verified specifically — that's the human-cleaned one?

MayaYes. A subset that human annotators went through by hand to make sure each task is actually solvable, the description is clear, and the tests are correct. Which is itself a lesson — they didn't trust the raw benchmark until people checked it.

LeoThat's kind of the episode in miniature. Even the test maker had to verify the test.

Maya[chuckle] Exactly. Then next to it on the rack, a different dial — Aider's polyglot one. Same idea of solving exercises, but across six languages, and it splits out something subtle: did the model get the answer right, and separately, did it produce edits the tool could actually apply.

LeoHold on. Those come apart?

MayaThey do, and that gap is gold. A model can reason out a correct fix and then format the edit so badly the tool can't apply it. Right answer, unusable delivery. The Aider leaderboard reports those as two different columns on purpose.

LeoOkay, that reframes "it failed." Failed to solve, or failed to hand it over cleanly — totally different bugs.

MayaAnd that's why the world comes first. Same agent, point it at bug-fixing and it looks brilliant; point it at a sprawling feature and it wanders. The instrument you pick is already half your answer.

MayaSecond stop. You've picked the instrument. Now — what actually decided pass or fail? Because "it passed" hides a choice.

LeoThe choice being... what counts as a judge.

MayaRight. There's a wall of signals, from cheap-and-rigid to rich-and-fuzzy. Visible tests the agent can see — easiest to game, because if you can see the test you can write to it. Hidden tests it can't see — much stronger, that's the SWE-bench style. An oracle, where you compare against a known-correct reference. And then human review, or a verifier model giving judgment instead of a stamp.

LeoAnd I'm guessing each one's blind in its own way.

MayaThat's the crucial part. Tests are a label generator — and a weak suite manufactures confident, wrong labels. If the hidden tests only check the easy path, an agent can pass them with a patch that's broken on everything they forgot to check.

LeoOhh. So a green checkmark isn't truth. It's "truth, as far as these particular tests bothered to look."

MayaBeautifully put. There's work showing exactly this — strengthen the test suite, and patches that were marked correct suddenly fail. The patches didn't change. The light got brighter.

LeoSo test quality literally is data quality. A weak test doesn't just miss bugs, it lies to your scoreboard.

MayaAnd that's why this stop matters as much as the first. The same patch can be "pass" or "fail" depending purely on how hard the judge looked. So in practice, you don't just record the result — you record what judged it.

MayaThird stop, and this is the one people skip. Don't only look at the final diff. Look at the path the agent took to get there.

LeoThe trajectory — that word came up in the overview. The whole recorded run, not just the ending.

MayaRight. Every search, file read, command, edit, retry. And it carries information the final patch erases. Two agents land the identical correct fix. One went straight there. The other thrashed for forty steps, broke the build twice, and stumbled into it.

LeoSame diff, same green check. But one of those I'd let near production unsupervised and the other I absolutely would not.

MayaThat's the payoff of trajectory-level signal. The outcome says *whether*. The path says *how*, and *how reliably*. If you only score the diff, a lucky thrash and a confident solve look identical — and they are not the same agent.

LeoIt's like grading a student only on the final answer versus reading their working. The working tells you if they'll get the next one too.

MayaAnd there's a real trade-off here, so let me be honest about it. Trajectories are expensive. They're long, noisy, hard to score automatically, and you can drown in them. Plenty of teams look only at outcomes precisely because the path is so costly to judge.

LeoSo it's not "always read the whole trace." It's "know that the trace holds signal the diff threw away, and decide when that's worth the cost."

MayaExactly the right framing. It's a dial, not a commandment.

MayaLast stop, and it ties the whole walk together. When you finally judge, you judge two things — the change, and the engineering process that produced it.

LeoWhich is the thing the overview kept hammering — review is its own skill, not a side effect of writing the patch.

MayaLet me make it concrete with our running example. A small team ships a fix to an open-source data library. Two gates: it has to pass hidden tests they can't see, and survive review by a human who cares about the rest of the codebase.

LeoTwo gates that can disagree. That's the billing thing from last time.

MayaThe same shape, yes. The patch passes every hidden test — green across the board. Then a reviewer reads four lines and goes, "this also silently disables refunds for one currency." Tests said ship. Human said absolutely not.

LeoAnd no test caught it because... nobody wrote a test for that currency.

MayaRight. The tests checked the common case. The reviewer checked *intent* and *blast radius* — does this match what we meant, and what else does it touch. That's not something a pass/fail oracle encodes. It's a separate capability, and it has to be evaluated as one.

LeoSo the full workflow is: pick the instrument for the work, know what signal judged it, read the path when the path matters, and then judge both the patch and the review of the patch.

MayaThat's the on-ramp. And notice none of those is a single number. Each is a question, and the number is just one cell in the answer.

LeoHmm — and the disagreements between them aren't a bug in the process.

MayaNo, they're the product. When the test gate and the reviewer disagree, that gap is exactly where the real risk was hiding. An evaluation that never surfaces a disagreement probably isn't looking hard enough.

LeoOkay, give me the one honest limitation before we close. Where does this whole careful workflow still let you down?

MayaHere's the uncomfortable one. Every instrument we've described — hidden tests, oracles, even human review — can be gamed or contaminated if the agent saw the answers during training. A benchmark everyone optimizes against slowly stops measuring skill and starts measuring memorization. So even the perfect workflow has a shelf life. The ruler wears down the more it's used.

LeoSo you don't just trust your instruments. You keep asking whether the instruments themselves still mean what they used to.

MayaAnd that humility is the on-ramp to the rest of the topic. We'll take each station — the benchmark families, the verification signals, the reviewer's desk, the trust meter — and go deep on how it can fool you and how to read it anyway.

LeoSo here's where I'll leave you. The next time an agent hands you a green checkmark — what's the one question you'd ask to find out whether that check actually looked where it mattered?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents