William Liu · Podcasts
2D editorial illustration: a reviewer's loupe over a wide code-diff panel that fans into several connected file-cards, with thin test-mark connectors tracing a feature boundary across them like a stencil; a cyan trace ribbon runs through a hidden-test gate, and a small detached feature-block lifts clear while the rest of the codebase stays standing. No text, no 3D.

T2E11 · 00:12:27

FeatureBench

A deep dive on FeatureBench, a benchmark that grades coding agents on building whole features — not fixing single bugs. Maya and Leo unpack its core trick: deriving feature-sized tasks by tracing unit tests along a project's dependency graph (the test suite as a stencil), checking that each feature is cleanly removable, and grading whole capabilities with hidden execution tests. The episode's headline is the score drop — a frontier model strong on bug-fixing benchmarks falls off a cliff on feature tasks — and what that gap reveals about the difference between the narrow patch skill and the wide engineering skill. Built on the running data-library example, stretched from a bug fix to a streaming-support feature ask.

Transcript

MayaHere's a one-line bug ticket: "search returns deleted users." A patch for that touches maybe two files. Now here's the next ticket from the same product manager: "add saved searches — users should be able to name a filter, store it, share it with a teammate, and get notified when new results match." That's not a patch. That's a feature. And the move today is a benchmark built entirely out of that second kind of ticket.

LeoSo instead of "find the broken line," it's "build the whole thing."

MayaRight — and notice what just changed. A bug fix has a known target: the thing that's broken. A feature has no single target. It's a new capability woven across many files, with a spec that's half-written, where "done" means a dozen new behaviors all work together. That's the gap this paper goes after. It's called FeatureBench.

LeoOkay, and to put it next to last time — Terminal-Bench was the long path, right? Lots of commands, install something, misread the output, recover, keep going. The agency was in surviving a long sequence inside a live environment.

MayaExactly. Terminal-Bench stretched the task across *time* — many steps in a command line. FeatureBench stretches it across *scope*. Today the hard part isn't a hundred shell commands in a row. It's that what you're being asked to build is large, under-specified, and spread across the codebase.

LeoTwo different ways of being hard. One is long, one is wide.

MayaThat's a good way to hold it. And here's the dirty secret of the whole topic: most coding benchmarks measure the *narrow* skill. Resolve one issue. Pass one set of hidden tests. Real engineering is mostly the wide skill — shipping a feature nobody fully specified.

LeoSo before we go further, plain version: what *is* a feature task here? Because "build a feature" could mean almost anything.

MayaFair, and the authors are precise about it. A feature task is a chunk of new functionality that, in the project's real history, landed across multiple commits and pull requests. Not one tidy diff — a span of related work that together added one capability.

LeoSo they're mining the actual git history of real projects.

MayaThey are. Twenty-four open-source projects. And the clever part is *how* they carve a feature out of that history — the heart of the paper. They start from the tests.

LeoFrom the tests. Not from the feature description.

MayaFrom the tests. The insight is that a real feature, when it shipped, came with unit tests that exercise it. So they trace those tests along the code's dependency graph — which functions call which, what depends on what — and that web of connected tests-and-code *is* the feature. They reverse-engineer the boundary of a capability by following what its tests touch.

LeoHuh. So the test suite is acting like a stencil. You spray it over the codebase and whatever lights up is one feature.

MayaThat's a lovely image, and it's basically right. And it's a *scalable* trick — it's automated. They don't have a human hand-drawing the boundary of every feature. The dependency trace derives the task. They mention something like three thousand-plus executable environments coming out of this pipeline, feeding into a couple hundred curated evaluation tasks.

LeoWait, hold on — the same worry I always have with automated curation. Last few episodes the punchline was that automation smuggles in garbage. Underspecified, over-strict, broken setups. Why doesn't tracing-from-tests have the same disease?

MayaIt's the right reflex, and they have a specific guard for it. When you cut one feature out of a project's history, you have to make sure the *rest* of the project still works without it. So they check that the other features remain functional after the separation. The cut has to be clean — pull this capability out, everything else still stands.

LeoSo the boundary isn't just "tests that happened to touch this code." It's "a piece you can actually remove and re-add as a unit."

MayaRight. Otherwise you'd hand the agent a task that's secretly entangled with ten other things, and no patch could ever satisfy it. The clean-separation check is what makes the feature a *real* unit of work and not an arbitrary slice.

LeoOkay. And then how is the agent actually graded? Because with a bug fix it's one catch-test flipping from red to green. A whole feature is fuzzier.

MayaSame family of idea, scaled up. It's execution-based — they run the code. The agent builds the feature, and then the held-out tests for that feature run against it. Not "does this one assertion pass," but "does this whole bundle of behaviors light up green." It's still hidden tests as the oracle, just the oracle is now checking a capability instead of a single fix.

LeoSo the agent never sees the grading tests, same as always.

MayaNever sees them. And that's important, because with a feature the temptation to teach-to-the-test is enormous. If you could see the tests, "implement the feature" collapses into "make these specific assertions pass" — which is a much smaller, much faker task. Hiding them forces the agent to actually understand what's being asked.

LeoWhich loops back to the under-specification problem. The ticket says "add saved searches." It does *not* say "and here are the fourteen behaviors I'll check."

MayaAnd that's the whole difficulty. On a bug benchmark, the issue text usually pins the target down. Here, the agent gets a feature description that's necessarily incomplete — like every real ticket — and has to infer the rest. The edge cases, the shape of the API, the parts of the codebase nobody mentioned. The hidden tests are quietly checking all of it.

LeoOkay, so now do the part I wait for. How did the agents actually *do*?

MayaThis is the number that makes the paper. They report a frontier model scoring around seventy-four percent on SWE-bench — strong, the kind of figure that lands in a launch post — and that *same* model scoring around eleven percent on FeatureBench.

LeoOh — wait. Same model. Seventy-four down to eleven.

MayaSame model, roughly that drop. And that gap *is* the argument. It says the thing we've been calling "great at coding" is mostly great at the narrow task — resolve a scoped issue. Point it at a wide, under-specified feature and most of that apparent capability evaporates.

LeoThat's a little sobering. We read "beats the benchmark" and assume the agent can build software.

MayaAnd it can build *patches*. Building *features* is a different, much harder game, and the benchmark makes that difference impossible to hide behind one number. That's the real contribution — not a new leaderboard, a new *kind* of question.

LeoAlright, I'm sold on the design. Now where does it hurt? What's the limit?

MayaThe biggest one is baked right into the clever part. Deriving tasks from existing tests means the benchmark can only see features that were *well-tested* in the first place.

LeoAh. Because the test suite is the stencil. No tests, no stencil, no feature.

MayaExactly. A capability that shipped with thin tests, or no tests, is invisible to this pipeline. So FeatureBench measures feature-building *as filtered through projects that happened to have good test coverage* — which is a particular, somewhat tidy corner of the software world.

LeoAnd a lot of real features are exactly the messy, badly-tested ones.

MayaRight. There's also under-specification cutting both ways. When an agent fails, you can't always tell *why* — did it misunderstand the feature, or understand fine but miss something the description left out and the hidden tests demanded? A low score signals something's hard, but it's blunt about *what* was hard.

LeoSo it tells you the wide skill is weak, but not precisely which part of "wide" broke.

MayaThat's the honest limit. And one more: the held-out tests define "done," and any test suite, even a good one, encodes one team's idea of the feature. A different-but-valid implementation might fail their tests while being perfectly mergeable. Same tension we keep hitting — passing tests and being right aren't the same thing.

LeoWhich is the topic's whole nervous system, really. The test gate is necessary and it's never quite sufficient.

MayaNever quite. And notice this lands the benchmark-families idea we've been building. SWE-bench Verified asks "can you fix a known bug." Terminal-Bench asks "can you survive a long terminal task." FeatureBench asks "can you build a whole capability from a vague ask." Each instrument is shaped to see a different kind of work — and a model can ace one and faceplant on the next.

LeoWhich is exactly why one leaderboard number can't carry the claim.

MayaIt can't. You have to ask *which* instrument, measuring *which* skill. A single percentage with no instrument behind it is a rumor, not evidence.

LeoLet me tie it to our running thread, too. Our little team shipping the fix to that open-source data library.

MayaGood — let's stretch it. Say the data-library team gets a feature ask instead of a bug: "add streaming support so users can process files too big for memory." That's a FeatureBench-shaped task. It touches the reader, the writer, the buffering, the public API — and nobody handed them the full spec.

LeoAnd "done" isn't one test going green. It's a whole bundle of new behaviors, plus not breaking the old ones.

MayaWhich is the clean-separation check, lived out. Add streaming *and* keep every existing capability working. And the valuable artifact, same as always, isn't "we shipped streaming." It's the whole replayable record — the feature spec, the multi-file diff, the held-out tests that defined success, the trajectory of how the agent got there, and the honest note that the spec was partial.

LeoCollect the work trace, not the headline. Even when the work is a whole feature.

Maya*Especially* then. Because a feature is where the gap between "looks done" and "is done" is widest. A patch that passes one test is easy to trust. A feature that passes a dozen tests but quietly broke a thirteenth behavior nobody wrote a test for — that's the thing that ships and bites you in production.

LeoWhich is the under-specification ghost again. The behaviors nobody specified are exactly the ones that break.

MayaAnd that's why the eleven percent is, in a strange way, the most useful number in the topic so far. Not because it's low — because it's *honest*. It's the frontier admitting that the wide skill, the real-engineering skill, is still mostly ahead of us.

LeoHere's a question to sit with, then. If your coding agent can fix bugs brilliantly but stumbles on whole features — would you rather it tell you up front "this is a feature, I'm probably going to miss something," or just confidently hand you a diff and let the hidden behaviors surface later?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents