William Liu · Podcasts
2D editorial illustration: a vertical ladder of cream rungs (eye, key-locked gate, balance scale, magnifier lens, human eye) re-judging the same code-diff panel, a cyan trace ribbon threading up to a needle trust meter, one diff line faintly flagged. No text, no 3D.

T2E2 · 00:13:30

Public Tests, Hidden Tests, and Oracle Tests

Verification is a ladder, not a switch. Maya and Leo climb it rung by rung — visible tests the agent can read, hidden tests it writes blind against (SWE-bench's catch and guard split), oracle tests that compare to a known-correct answer, LLM-based verifiers that score the whole trajectory and intent, and human review — showing what each rung proves and exactly what it stays blind to. The thread throughout is one false positive: a data-library patch that passes the tested path while silently disabling refunds for an untested currency, and which rung finally catches it.

Transcript

MayaHere's a moment that happens a hundred times a day. An agent finishes a patch, runs the test suite it can see, watches every line go green, and announces: fixed. And it's wrong. The bug is still there — the visible tests just never poked at it.

Leo[chuckle] So the agent aced the open-book exam and failed the actual class.

MayaThat's the whole episode. There isn't one way to judge a patch — there's a *ladder* of signals, and each rung proves something different and stays blind to something else. The trick is knowing which rung you're standing on.

LeoOkay, so last time we did the practical side — how you actually run an evaluation without fooling yourself, what to log, how to make the number honest.

MayaRight. Today we go one level underneath. Forget how you run the eval — what is the *signal* doing the judging? When something says "this patch is good," what did it check, and what did it quietly skip?

LeoA ladder of signals. So let's climb it. Bottom rung first — plainest one.

MayaBottom rung is the visible test. The agent can see it — it's right there in the repo, part of the code you handed over. The agent reads it, runs it, tunes the patch until it passes.

LeoAnd that's... fine? That's how every human developer works too.

MayaFine as a development aid. Dangerous as a *grade*. Because once the agent can see the test, the test stops measuring whether the bug is fixed and starts measuring whether the agent can satisfy that specific test.

LeoOh. So it can write code that hugs the test — hard-code the one example the test checks, and it goes green without the real behavior existing.

MayaPeople call that overfitting to the test, or in the nastier version, gaming the oracle. A visible test is a label generator, and if the agent reads the label before submitting, the label gets corrupted. So it proves one thing: the agent can make *these exact cases* pass. That's all.

LeoWhich is real, but small. Okay, next rung up.

MayaNext rung: the hidden test. Same idea — code in, pass or fail out — but the agent never sees it. It writes the patch blind, and only afterward does the harness run a held-out suite against it.

LeoAnd that fixes the hugging problem, because you can't tune to a test you can't read.

MayaExactly. This is the heart of a benchmark like SWE-bench Verified. Each task harvests a real pull request that already fixed a real bug, and it came with tests that split into two jobs. One is the catch test — failing on the broken code, and it must flip green if you actually fixed the thing.

LeoThe "did you fix it" signal. And the other job?

MayaThe guard tests — already passing, and they must *stay* passing. They catch you breaking the rest of the house while patching one window. The rule is strict: catch test flips green AND every guard stays green, or it doesn't count.

LeoIn the benchmark's own language those are the fail-to-pass and pass-to-pass sets, right?

MayaThat's them. Fail-to-pass is the catch, pass-to-pass is the guard. Both hidden is what makes the number mean something — the agent earned the green without ever reading the answer key.

LeoOkay, hidden tests sound great. What's the catch on the catch test?

Maya[chuckle] The catch is test *adequacy*. A hidden test only proves what it actually exercises. If the held-out suite has no case for some edge — an empty input, a weird currency, a concurrent call — a patch that breaks that edge still sails through green.

LeoWait. So hidden doesn't mean thorough. The test can be hidden and *also* shallow.

MayaThat's the point most people skip. Hiding the test fixes gaming; it does nothing for coverage. There's good work — UTBoost is the one I'd point to — showing that when you strengthen the held-out tests on existing benchmark tasks, a chunk of patches marked "resolved" turn out to be wrong. They passed only because the original tests were too weak to catch them.

LeoOof. So the benchmark confidently labeled a broken patch as correct.

MayaConfidently wrong labels. And that's the deepest thing about tests in general: the suite *is* your label generator, so test quality is data quality. A weak suite doesn't just miss bugs — it manufactures false positives everyone downstream trusts.

LeoLet me anchor this. Our running example all topic has been a fix to an open-source data library — a patch that has to pass hidden tests AND survive a human reviewer. Walk me through a false positive there.

MayaPerfect case. Say the library handles refunds across currencies. The patch fixes the reported crash, the hidden suite checks the common path — dollars, a normal amount — green. Catch test flipped, guards holding. Benchmark says: resolved.

LeoBut there's a "but" coming.

MayaBut the patch silently disabled refunds for one currency the hidden suite never tested. The label says perfect; reality says you shipped a data bug. The test didn't lie — it just never asked the question.

LeoAnd no amount of hiding the test catches that. You'd need a test that *thought* to check that currency.

MayaWhich is why we keep climbing. Next rung: the oracle test.

LeoPlain version — what makes something an oracle versus just a test?

MayaAn oracle is a source of ground truth you compare against — a known-correct reference. The cleanest one is: here's the exact output the right answer must produce, byte for byte. A parser that must emit one specific tree. A function with one provably correct result.

LeoSo where you *have* a true oracle, you're golden. The judgment is exact.

MayaIt's the strongest signal on the ladder. The problem is most real engineering doesn't have one. There's no oracle for "is this maintainable," no reference for "would a reviewer merge this," no byte-for-byte truth for "did you match the developer's *intent*." Oracles are precise exactly where the world is small.

LeoHmm. And this is where experts actually split, isn't it? I can feel a disagreement coming.

MayaThey split hard, and both camps deserve their strongest form. One says: executable tests are the only objective oracle, full stop. A test passes or it doesn't — no opinions, no hallucinations, fully reproducible. Everything else is vibes wearing an evaluation costume.

LeoAnd that's not weak. Reproducibility is the whole reason benchmarks work.

MayaGenuinely strong. The other camp says tests miss most of what matters — maintainability, intent, reviewability, a regression three modules away, none of it encoded in any test. A patch that passes every test and is still un-mergeable is *common*, so stopping at the test gate leaves you blind to half the job.

LeoAnd both are right at once, which is the annoying part.

MayaThat tension is why the next rung exists. When tests run out, you reach for a verifier — a model that judges.

LeoOkay, "a model that judges" makes me nervous immediately. Plain version?

MayaFair to be nervous. An LLM-based verifier is a model you point at the work and ask: does this meet the criteria? It can read things a test can't — the diff, the reasoning, whether the change matches what the issue asked for. There's a framework literally called LLM-as-a-Verifier that's a clean example of the modern version.

LeoAnd what does that do beyond just asking a model "is this good, yes or no?"

MayaThree moves. First, it doesn't grade the final patch alone — it scores the whole *trajectory*: the searches, the file reads, the commands, the retries — the path the agent took, not just where it landed.

LeoTrajectory-level scoring. So it rewards an agent that got there by sound reasoning and docks one that stumbled into a green test by luck.

MayaRight. Second, instead of one yes-or-no, it breaks the judgment into separate criteria with fine-grained scores, leaning on the model's own confidence rather than a hard label. Third, it verifies repeatedly and aggregates, because one model pass is noisy.

LeoDecompose, score finely, repeat. Much more careful than a vibe check. So what's the catch — there's always one on this ladder.

MayaThe catch is the deepest yet. The verifier is itself a model, so it can be confidently wrong — hallucinate a problem that isn't there, or bless a patch that's subtly broken. And the whole approach leans on its confidence being *calibrated*, so that when it's unsure, the score reflects it.

LeoAnd calibration is famously shaky in these systems. So the thing judging the patch has the same failure mode as the thing writing it.

MayaThat's the uncomfortable symmetry. A test can be weak, but it can't *hallucinate*. A verifier reads more than a test ever could — and it can also invent. You're trading coverage for a new kind of error. Not a reason to skip it; a reason to never let it be the *only* judge.

LeoWhich leaves the top rung. Humans.

MayaTop of the ladder: human review — the one signal that can ask the question nobody wrote a test for. A reviewer reads the refund patch and goes — wait, what happens to that other currency? They carry intent and context and taste in a way no oracle encodes.

LeoSo that's the gold standard. Why isn't everything just human review?

MayaBecause it doesn't scale, it isn't reproducible the way a test is, and reviewers disagree — with each other and with themselves on a bad day. And a subtler limit: even "a human would merge this," what we call mergeability, is a judgment, not a measurement. Two senior reviewers can split on the same diff.

LeoSo even the top rung isn't an oracle. It's the *richest* signal and the *least* reproducible one at once.

MayaWhich is the shape of the whole ladder. As you climb — visible test, hidden test, oracle, verifier, human — you gain richness; you can judge more of what matters. And you *lose* hardness. The bottom is reproducible and shallow; the top is deep and slippery.

LeoSo nobody picks one rung. You stack them.

MayaYou stack them on purpose. Hidden tests catch the gaming. Stronger tests catch the coverage gaps. A verifier catches the trajectory and the intent a test can't see. A human catches the thing nobody thought to encode. Each one covers the rung below's blind spot — and none, alone, is the truth.

LeoAnd the false positive we built — the disabled currency — which rung finally catches it?

MayaDepends who's there. The visible test misses it. The hidden test misses it unless someone wrote that case. The verifier *might* flag it if the trajectory looked suspicious around that code. The human, if they care about the rest of the codebase, catches it on sight. The bug doesn't get harder; the ladder just decides how many chances you get to see it.

LeoAnd today's source is really a survey of those rungs more than any single paper.

MayaRight — the verification-signals module is the spine, and LLM-as-a-Verifier is the concrete instance of the judgment rung; links are in the notes. The point isn't to memorize one tool. It's to never again hear "the patch passed" without asking the next question.

LeoWhich is: passed *what*.

MayaPassed what, judged by whom, and blind to which question. That's the literacy.

LeoSo here's where I'd leave a listener. Think about the last green checkmark you trusted — a test, a CI run, an agent saying "done." Which rung of the ladder produced it, and what's the one question it never asked?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents