T2E5 · Jun 5, 2026 · 00:12:35

T2E5 · Evaluating Review Quality

Review quality is not bug recall. This concept deep-dive builds the review-quality scorecard row by row — recall and precision, the signal-to-noise line, specificity and actionability, the special danger of hallucinated issues, and the trust row they all feed into — using one running example: two agents reviewing the same data-library patch, where the quieter reviewer that catches fewer bugs is the one the developer keeps reading. Maya and Leo show why noise sits on the cost side of the ledger for review the way it never did for patch generation, and why trust can only be measured by what a human does next.

Transcript

MayaTwo review agents land on the same pull request. The first leaves three comments, and one of them is the real bug. The second leaves thirty comments, and four of them are real bugs.

LeoSo the second one found more. Higher recall, right? It caught four instead of one.

MayaHigher recall. And here's the twist the whole episode turns on — the developer trusts the first agent and mutes the second. The one that caught fewer bugs is the one people actually read.

LeoHuh. So catching more bugs made it *worse* to use.

MayaThat's the move. Today we're not measuring whether a reviewer can find a bug. We're measuring whether a reviewer is worth listening to. Different scorecards.

LeoLast time we treated code review as its own capability — a reviewer agent that reads a diff, understands intent, finds latent defects, and lands its comment on the exact line, graded against what real human reviewers flagged.

MayaRight. That episode was about *can the agent review at all*. Today sharpens it: even a reviewer that finds real bugs can be a bad reviewer. The capability is necessary. It isn't enough.

LeoOkay. Plain language first — when you say "review quality," what are we actually scoring?

MayaThe net value a developer gets from running the reviewer — across every comment it leaves, not just the good ones. A human grades a reviewer the same way. Flag one true problem and bury it under ten guesses, and people stop reading you.

LeoSo it's the reviewer-as-a-whole, including the misses and the false alarms.

MayaIncluding the false alarms especially. That's the part bug-counting throws away.

LeoSet my map so I don't lose it on a walk. What are we walking through?

MayaA small scorecard with a few rows. The first two everybody knows — recall and precision. Then a ratio row, the signal-to-noise line. Then two rows about whether a comment is *usable* at all — specificity and actionability. And the last row the others all feed into: trust.

Figure 2 — the CR-Bench pipeline turning real-world defects into a code review benchmark scored by CR-Evaluator on precision, recall, usefulness, and developer acceptanceSource: CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents (Pereira et al., 2026)

LeoRecall, precision, signal-to-noise, then specificity and actionability, then trust. Got it.

MayaStart with the two everyone reaches for. Recall is: of all the real defects in this pull request, how many did the reviewer catch? A reviewer that misses the data-corruption bug has low recall, full stop.

LeoAnd that one really matters, because a missed bug can ship.

MayaIt can. A missed defect is sometimes unrecoverable — a security hole or corrupted data in production. So there's a real instinct to crank recall up: flag everything, never miss.

LeoAnd precision is the other direction — of all the comments it left, how many were actually real?

MayaExactly. When this reviewer speaks, is it usually right? Thirty comments, four real bugs — that's decent recall and brutal precision. Most of what it said was wrong.

LeoSo they pull against each other. If I flag everything to never miss a bug, I drown in false alarms. If I only flag when I'm sure, I miss some.

MayaThat tension is the heart of it. And here's the thing the curriculum behind this series keeps insisting on — for code review, you cannot just chase recall. The field learned that a reviewer optimized purely for "never miss" becomes a reviewer nobody reads.

LeoWhich means its *effective* recall, the bugs that actually reach a human who fixes them, drops to near zero.

MayaOh, that's the sharp version. Paper recall can be high while real recall — the bugs a person acts on — collapses, because the human stopped reading at comment number six.

MayaWhich brings us to the ratio row. Signal-to-noise. In plain terms: out of everything the reviewer says, what fraction is worth a human's attention?

LeoSo this is basically precision wearing different clothes?

MayaRelated, but the framing matters. Precision is a percentage on a spreadsheet. Signal-to-noise is how it *feels* to live with the tool — a bad ratio actively trains the developer to skim and dismiss.

LeoHmm. Like a smoke alarm that goes off every time you make toast. By the third false alarm you don't even look up.

MayaThat's the exact failure. Once you've taught someone to ignore the alarm, the one real fire gets ignored too. The noise doesn't just waste time — it poisons the signal that's actually correct.

LeoSo a noisy reviewer is genuinely worse than a quiet one, even if the noisy one catches more on paper.

MayaThat's the claim a whole branch of this work is built on. A reviewer nobody trusts catches nothing, because nobody reads it. And this is also where review evaluation splits hard from patch-generation evaluation.

LeoSay more — how is scoring a reviewer different from scoring a patch?

MayaA patch benchmark mostly doesn't care about false positives. Your patch passes the hidden tests or it doesn't. But a review benchmark *has* to count the false alarms, because the false alarms are what destroy the tool's value. Review scoring puts noise on the cost side of the ledger, where patch scoring never had to.

LeoRecap for me: signal-to-noise is the row that turns "was it right" into "is it worth listening to."

MayaAnd it's the row patch benchmarks never needed.

MayaNow the two rows about whether a comment is even usable. Because a comment can be perfectly *true* and still useless. Start with specificity. Did the reviewer point at the exact line, or wave at the whole file?

LeoGive me the data-library example. Say our patch adds a fast path to a function that aggregates numbers.

MayaGood. A low-specificity comment says "there might be an overflow issue somewhere in this change." Technically maybe true. Useless. A high-specificity comment lands on the exact added line and says "this accumulator can overflow when the input column is all large integers."

LeoThat's the difference between "there's a leak in your house" and "the pipe under the kitchen sink."

Maya[chuckle] Exactly. Same truth, wildly different value. Specificity is whether the reviewer localizes — circles the statement, the way a good human reviewer does.

LeoAnd actionability is the next step past that — not just *where*, but *what do I do*.

MayaRight. Actionability asks: can the author actually act on this comment? "This is fragile" is true and unactionable. "Cast the accumulator to a wider integer type before summing" is a move the author can make right now. A reviewer that only produces vague unease scores badly here even when it's correct.

LeoSo a single comment is really a little bundle. Is it real, is it located, can I do something about it.

MayaAnd it can fail any one of those independently. True but vague. Located but unactionable. That's why you score them as separate rows instead of collapsing everything to a thumbs up.

LeoWhich is also where the hallucination problem lives, I'd guess.

MayaThat's the one I'd underline hardest. A hallucinated issue is a comment that's confidently wrong — the reviewer describes a bug that simply isn't there. The variable it's worried about can't be null. The race condition it flags is already guarded. And those are the most expensive kind of noise.

LeoWhy the most expensive? A wrong comment is a wrong comment.

MayaBecause the author has to *investigate* it to find out it's fake. A vague comment you skim past. A confident, specific, hallucinated bug report — "this will deadlock under load" — sends a human off to disprove it. The cost isn't the comment. It's the wild goose chase, paid out of a senior reviewer's afternoon.

LeoOof. So specificity cuts both ways. A specific *correct* comment is gold. A specific *wrong* comment is a trap.

MayaThat's exactly why you can't just reward specificity or just reward recall. The same property that makes a reviewer helpful makes it dangerous when it's wrong.

MayaWhich lands us on the bottom row, the one all the others feed into. Trust. Or, the way practitioners talk about it — usefulness.

LeoThis is the developer-keeps-or-kills-the-tool row.

MayaRight. Trust is whether the developer still opens the reviewer's comments next week. And it's not a separate measurement you bolt on — it's the *consequence* of the rows above. High noise, low specificity, the occasional hallucinated deadlock — those don't just lower a score, they spend down a budget. Once trust is gone, even the correct comments stop landing.

LeoSo how do you even measure trust? It sounds squishy next to "did the test pass."

MayaIt is squishier — that's the honest limitation. The closest proxies are downstream human reactions. Did the author accept the comment or dismiss it? Did they apply a fix in response? At industrial scale, teams have looked at exactly that — accepted comments, applied fixes, developer acceptance — as the real-world ground truth for whether a reviewer is trusted.

LeoAnd those are reactions you can't read off a static diff.

MayaNo oracle of green checkmarks. You only learn it by watching what a human did next — which is why it's the hardest row to benchmark cleanly, and why the field is still building instruments for it.

LeoLet me name the limitation plainly, since every one of these episodes owes one.

MayaPlease.

LeoTrust is a snapshot problem pretending to be a number. A benchmark run measures one moment. But real trust builds and erodes over weeks of a developer living with the tool — one bad week of false alarms can sink a reviewer that scored beautifully on the bench.

MayaThat's it exactly. And there's a second edge: judging whether a comment is "actionable" or "severe" is itself a judgment call. You often need a model or a human in the grading loop, and that grader can be biased or wrong — the usual risk for any setup where one model judges another. So even our trust scorecard has its own trust problem.

Leo[laugh] The reviewers reviewing the reviewers.

MayaAll the way down. But the takeaway survives the messiness. Review quality is not bug recall. It's whether a useful signal survives contact with a real developer's attention.

LeoGive me the memory hook before we close.

MayaHere it is. For a reviewer, recall gets you onto the field. Precision and signal-to-noise decide whether you stay on it. And specificity, actionability, and the absence of confident hallucinations are what earn the one thing that actually matters — that an engineer is still reading by comment number ten.

LeoAnd that ties straight back to the series promise. For review, the score isn't a count of bugs. It's the whole story — the false alarms, the line locations, the things the author could actually do, and whether they did anything at all.

MayaExactly. The good reviewer isn't the one that flags the most. It's the one whose next comment you still bother to open.

LeoHere's the question I'll leave for everyone listening: if you were tuning your own review agent, would you rather it caught one more real bug per week — or left ten fewer false alarms, so your team actually kept reading it?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents