William Liu · Podcasts
2D editorial illustration: a reviewer's loupe over a code-diff panel, with a cyan ribbon splitting into three small icon-marked question cards (a shape tag, a pin on a line, a wrench among options), a closed hidden-test gate upstream and a needle trust meter downstream. No text, no 3D.

T2E16 · 00:12:55

CodeReviewQA

CodeReviewQA grades the part of code review that comment-generation benchmarks can't: did the model actually understand the change? Maya and Leo unpack how the paper turns review comprehension into multiple-choice question-answering over a diff — recognizing the change type, localizing it to the exact line, and identifying the right fix — so a single fuzzy 'bad comment' becomes a readable per-step diagnosis. They run the topic's data-library patch through the three gates, then weigh the catch: multiple choice tests recognition, not the open-ended review you do in the wild.

Transcript

MayaHere's a moment every reviewer knows. A model spits out a review comment — "consider handling the null case here" — and it sounds right. Polished, confident, on-topic. And you stop and think: wait, does this thing actually understand what changed in this diff? Or did it just pattern-match a sentence that sounds like a review?

LeoRight. The comment reads fine, but you can't tell from it alone whether there's real comprehension behind it or just good autocomplete.

MayaThat gap is the whole episode. Today's paper, CodeReviewQA, makes one sharp move: before you let a model write a single review comment, you check whether it understood the change at all — by asking it multiple-choice questions about the diff.

LeoQuestions with right answers. Not "write me a comment," but "tell me what you see."

MayaExactly. It turns review comprehension into a quiz.

LeoOkay, before we open that up — last time?

MayaLast time was CodeReviewer, the granddaddy. It planted the idea that review is a learnable task and split the job into three: judge the diff, write the comment, fix the code. We called them the Verdict, the Note, and the Patch.

LeoAnd the honest confession was that the Note — the actual written comment — had no clean answer key. You score it by overlap with the one human comment that happened to be there, which punishes a different-but-correct comment.

MayaThat's the thread today's paper grabs. CodeReviewer couldn't cleanly score the comment because comments are open-ended. CodeReviewQA says: fine — stop scoring the comment. Score the understanding underneath it, and make that part have right answers.

LeoOh. So instead of fixing the fuzzy metric, it changes the question being asked.

MayaThat's the pivot. Plain-language idea first. CodeReviewQA — the QA is question-answering. The benchmark takes a real code change plus the review situation around it, and instead of asking the model to generate a refinement, it asks the model to answer questions about it. Closed questions. Pick the right option.

LeoAnd why does "closed questions" matter so much? On the surface it sounds like a downgrade — multiple choice instead of real writing.

MayaIt sounds like a downgrade, but it buys two things. One, a crisp grade — there's a correct answer, no overlap-with-one-human guesswork. And two, the subtle one — you see exactly where the model breaks. Not just "it got the review wrong," but "it understood what kind of change was needed, and where, but picked the wrong fix."

LeoHmm. So it's diagnostic. Like a doctor isolating which joint hurts instead of just "your arm's broken."

MayaThat's a good way to hold it. And that diagnosis comes from how they break the comprehension up — three steps, and I want to give them landmark names so you can hear them.

LeoPlease, name them.

MayaPicture three gates the model walks through, each a question. First gate — call it the What. Given the code and the situation, what kind of change is being asked for? A bug fix, a refactor, a style tweak, a missing edge case? The paper calls it change type recognition. In your head: the What.

LeoSo before anything else, does the model even know what category of problem it's looking at.

MayaRight. Second gate — the Where. You know what kind of change; now, where in this code does it need to happen? Which line? That's change localization. The Where.

LeoThat's the one I'd expect models to fumble. Knowing something's wrong in the abstract is easy. Putting your finger on the exact line is the real skill.

MayaIt is — and localization is where confident-sounding review falls apart all through this topic. Third gate — the How. What's the actual correct fix? That's solution identification. The How.

LeoWhat, Where, How. And each one is a multiple-choice question with one right answer.

MayaEach one. And here's why the order matters — they're a chain. A model can sail through the What and then completely whiff the Where, point at the wrong line. The three gates catch that. Comprehension didn't fail in one lump; it failed at the Where, specifically.

LeoOh, that's the disentangling you meant. The old way, a bad review comment is just a bad review comment. Here you can read the failure.

MayaThat's the cleanest contribution. The paper's own framing is that it disentangles comprehension from the generative refinement result — separates "did you understand the change" from "did you produce the right rewritten code." Those come apart. A model might understand perfectly and still botch the generation, or fake-understand and get lucky on the output.

LeoLet me make sure I see the contrast with comment-generation benchmarks, because that's the heart of it. The old style — you hand the model a diff, it writes a sentence, you grade the sentence.

MayaGrade it against a reference sentence with text-overlap scoring. That's the soft metric we beat up last time — two correct comments that share no words score far apart. The paper points right at this: text-matching metrics give limited insight into model failures.

LeoLimited insight. The number tells you the model lost, but not why.

MayaRight. A low overlap score is a black box. Did the model misunderstand the change, or understand fine and just phrase it unlike the human? You can't tell. CodeReviewQA replaces that black box with three readable gates.

LeoAnd there's a second thing closed questions buy you — the contamination angle.

MayaGood catch. Real worry across the whole topic. If a benchmark is open-ended generation and the model saw that exact pull request in training, it can regurgitate. With multiple-choice over a curated set, the paper argues the structured QA format helps mitigate that contamination risk — picking the right answer in context is harder to memorize your way through than reproducing a comment you've seen.

LeoHarder to fake. Okay. What did they build it on — how big, how many models?

MayaThey hand-curated the set across nine programming languages — same breadth instinct as CodeReviewer, that reviewer judgment is part language-specific. Then they ran it across a large batch of recent models, dozens of them. It landed in the ACL Findings track. I'll keep the exact counts out of the spoken version.

LeoAnd the punchline of running all those models?

MayaThe honest one: it exposes specific weaknesses the generation score hides. Models that look comparable on "write the comment" turn out to differ on which gate they trip at — fine on the What, weak on the Where. The benchmark makes those differences visible instead of averaging them into one fuzzy comment score.

LeoLet me pull our running example through this, the one from the topic. Small team, fix to an open-source data library, the patch has to pass hidden tests and survive a human reviewer. And the dangerous version passes the empty case, the common case — but silently disables refunds for one currency.

MayaPerfect, run it through the three gates.

LeoSo a review model looks at that diff. The What — it might correctly say "this is a logic change to a conditional branch." Good, gate one. The Where —

MayaAnd the Where is where it gets interesting. The currency bug lives in one specific branch of one condition. If the model points at that exact line — the refund path for that currency — it passes the Where. If it waves at "somewhere in this function," it fails the Where, and the benchmark records exactly that.

LeoAnd the How is the fix. Restore the branch, handle the currency.

MayaRight. And here's the payoff. With a comment-generation benchmark, a slightly-off comment about that bug just gives you a mediocre overlap score and a shrug. With CodeReviewQA, you'd see: it nailed the What, nailed the Where, picked the wrong How. Now you know the model can find the bug but can't fix it — a different problem from one that can't even find it.

LeoThat's genuinely useful. It's the difference between a reviewer who's blind and one who sees but gives bad advice. You'd manage those two very differently.

MayaAnd you'd never know which one you had from a single comment score. That's the case for the whole approach.

LeoOkay, give me the limitation, because every episode in this topic has one and I'm braced.

MayaBrace, yeah. The big one is the flip side of the strength. Multiple-choice is clean, but real review isn't multiple-choice. Nobody hands you four options for "where's the bug" — you stare at fourteen files and the answer space is open. So CodeReviewQA measures comprehension in a constrained setting — recognizing the right answer among given choices — not producing it from scratch in the wild.

LeoAh. Recognition versus generation. I can recognize the right chess move from a list way better than I can find it on a real board.

MayaThat's exactly the gap. A model can be strong at picking the right option and still flounder generating the whole review unprompted. So a high CodeReviewQA score is necessary-ish — if you can't recognize the change, you can't review it — but it's not sufficient. It's a comprehension floor, not proof of real-world review skill.

LeoAnd it inherits CodeReviewer's keyhole a bit too, right? It works at the level of a change and questions about it, not whole-PR, cross-file reasoning.

MayaThat's fair. The questions are scoped to a change and its review context, not "reason across the entire repository." So the cross-file judgment our currency bug really demands — the interaction three files over — sits mostly outside these three gates. CodeReviewQA sharpens the comprehension question; it doesn't widen the lens to the whole PR.

LeoSo where does it land in the staircase? CodeReviewer planted "review is learnable." This one...

MayaThis one says "before you grade the review, grade the understanding — and make understanding have right answers." It takes the least-measurable thing from last time — does the model actually get the change — and turns it into something you can score cleanly and read diagnostically.

LeoIt's almost the inverse of CodeReviewer's confession. CodeReviewer said the part we most want, we can least measure. CodeReviewQA says: okay, then let's measure the part under it that we can.

MayaThat's a lovely way to put it. It doesn't solve the un-scorable comment. It steps underneath the comment, to comprehension, and builds firm ground there.

LeoAnd the cost of that firm ground is the multiple-choice frame. Clean grade, less realism.

MayaAlways the trade. You buy a crisp, diagnostic, contamination-resistant signal, and pay with the artificiality of picking from options instead of reviewing for real. Whether that floor predicts real review skill — that's the open part.

LeoWhich is a nice thing to sit with, actually.

MayaIt is. So here's the question to carry out. If a review model aces the recognition quiz — it can pick the right what, the right where, the right how every time — but you've never watched it review a messy pull request on its own, how much would that quiz score change how much you trust it?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents