William Liu · Podcasts
2D editorial illustration: a code-diff panel resting on a flat desk calendar with earlier pages fanned behind it, a small container icon feeding it through a closed hidden-test gate, a cyan trace ribbon running past a reviewer's loupe to a needle trust meter. No text, no 3D.

T2E8 · 00:13:01

SWE-bench-Live

SWE-bench-Live is the continuously-refreshed answer to a benchmark's slow expiry. Maya and Leo unpack why it exists — a fixed set of public GitHub issues gets contaminated and goes stale — and how Live sidesteps both by automatically curating fresh, post-training issues each month, scoring them with the same hidden catch/guard tests inside auto-built containers. They contrast it with Verified's hand-cleaned frozen set, explain how frozen lite/verified splits keep ranking fair, and walk the real costs: environment flakiness, test-validity decay, perpetual maintenance, and noisy small snapshots.

Transcript

MayaPicture a benchmark that quietly expires. Not all at once — slowly. Every month it gets a little less trustworthy, because every month the models it grades have read a little more of its answer key. The fix isn't to scrub it. The fix is to keep printing brand-new exam questions, in public, faster than anyone can memorize them.

LeoSo the instrument refreshes itself on a clock.

MayaThat's the whole move today. A benchmark that pulls in problems nobody could have trained on — because they didn't exist yet — and it does it on a rolling schedule.

LeoOkay, and last time we were on the opposite strategy. SWE-bench Verified — humans walking the room, pulling the broken answer keys off the wall, trading scale for a hand-cleaned, trustworthy five hundred.

MayaRight. Verified makes a *fixed* set honest by inspecting it once, carefully. Today's source makes a set honest a completely different way — by refusing to let it sit still. It's called SWE-bench-Live.

LeoAnd "Live" is the whole thesis in one word.

MayaIt is. Same backbone as the original — resolve a real GitHub issue, get judged by hidden tests — but the question set keeps moving.

LeoBefore we get to how it moves, plain version: what's the problem it's actually solving? Because Verified already fixed the broken-answer-key thing.

MayaVerified fixed *unfairness*. It did not fix *staleness*. Two different diseases. Let me name them, because the rest of the episode hangs on telling them apart.

LeoOkay, name them for me.

MayaThe first disease is contamination. A benchmark made of public GitHub issues, with public fixes, sitting on the internet for a year or two. Train a new model on a scrape of that internet, and the model may have literally seen the patch. So a high score might be reasoning — or it might be recall.

LeoWhich we hit hard last episode. You can't tell memory from skill from the outside.

MayaYou can't. And the second disease is plain aging. The model frontier moves every few months. A static set of tasks from, say, two years ago slowly stops describing the problems people file *today*. The libraries changed, the idioms changed, the bug shapes changed.

LeoSo even an honest, hand-cleaned benchmark gets quietly out of date.

MayaExactly. Verified is a beautifully calibrated thermometer — but it's measuring the temperature of one specific afternoon, frozen. SWE-bench-Live asks: what if the thermometer re-reads the room every month?

LeoOkay. So how does it actually pull fresh problems? Because the magic of the original was that the tests already existed in the project's history. You can't hand-author thousands of new tasks every month.

MayaYou can't — and that's the real engineering story here. The key word on the page is *automated*. There's an automated curation pipeline. Fresh issues come in from real GitHub repositories, recently filed, recently fixed, and the pipeline turns each one into a runnable graded task without a human babysitting every step.

LeoHmm. That's the part I want to slow down on, because last time the punchline was that automation is exactly what smuggled in the garbage. Underspecified issues, over-strict tests, broken environments. So what's different now?

MayaGreat instinct, and it's the heart of it. The hardest part of making a SWE task isn't finding the issue. It's building the *environment* — getting the repo to install, compile, and actually run its tests inside a clean container. That setup is the bottleneck, and historically it's where humans had to step in.

LeoRight, every repo is its own little snowflake of dependencies and build steps.

MayaSo SWE-bench-Live leans on an agentic tool for that part. Think of it as a setup agent — its whole job is: given a fresh repository, produce a testable containerized environment. It figures out how to install the thing and make its tests run, so the grading harness has solid ground to stand on.

LeoSo an agent builds the room the other agents get graded in.

MayaThat's a clean way to put it. The benchmark uses a coding-agent-style tool to manufacture the very environments it then uses to test coding agents. And that's what makes monthly refresh even possible — the expensive step got automated.

LeoOkay, and once the room is built, the grading is the familiar shape?

MayaFamiliar shape, yes. The agent gets the issue and the repo snapshot just before the fix. It writes a patch — a git diff. The harness runs that patch inside the container against the held-out tests. There's a reference fix, a gold patch, that the real tests were designed around.

LeoAnd the agent never sees those tests.

MayaNever sees them. Same hidden-test logic as the original — the catch test that has to flip from failing to passing, the guard tests that have to stay green. The only thing that's really new is *when* the task was born. Recent. Unseen. Post-dating the model's training cut, ideally.

LeoSo that's the contamination answer. If the issue was filed and fixed *after* the model finished training, the model couldn't have memorized the patch. Full stop.

MayaThat's the elegant part. You don't have to *prove* a model wasn't contaminated — which is nearly impossible. You sidestep it. Use problems that are newer than the model. The calendar does the work that auditing can't.

LeoAnd the refresh cadence — how fresh is fresh?

MayaThe page describes a monthly rhythm. Each month, a batch of newly verified, high-quality issues gets added to the test split. So the live portion keeps growing forward in time, snapshot by snapshot.

LeoNewly *verified* — so there's still a quality bar, not just "scrape whatever's on GitHub this week."

MayaThere's still filtering. Verified-and-high-quality is the language on the page. The aspiration is contamination-free *and* not-garbage, which means the automated pipeline has to do some of the work humans did by hand in Verified — checking the task actually runs and the tests actually mean something.

LeoLet me also ask the thing that trips me up. If it's adding a batch every month, isn't comparing two models unfair? One got tested on the March questions, one on the May questions.

MayaSharp, and they thought about it. There are *frozen* splits alongside the rolling one. A lite split and a verified split that stay fixed, so leaderboard comparisons are apples to apples and costs stay manageable. The rolling part is for freshness; the frozen part is for fair head-to-head.

LeoAh. So it's not one or the other. The live stream gives you contamination resistance, and the frozen slices give you a stable ruler.

MayaTwo instruments in one kit. And it's not only Python anymore — there's a multi-language slice across many repos, and even a Windows-environment slice. The original was famously English-issues-and-Python; this stretches the lens wider.

LeoOkay, I'm sold on the design. Now do the part you always make me wait for. Where does it hurt?

Maya[chuckle] The limits. And they're honest ones — the page states them plainly, which I respect. The first is environment reproducibility. The page literally says a container does not guarantee full isolation, and that tests can become invalid over time.

LeoWait — invalid how? A test just... rots?

MayaIt can. A test that passed cleanly in March might start failing in June for reasons that have nothing to do with anyone's patch. An upstream dependency shifted, a timestamp, a network call, something machine-dependent. The world underneath the test moved.

LeoSo the same flakiness ghost from last episode, but worse — because a *live* benchmark is always reaching into the present, where things are still changing.

MayaExactly the tradeoff. Freshness and stability pull against each other. The recommendation on the page is telling: run the evaluation with the gold patch a few times first, and use that to filter out instances that are already misbehaving. You're sanity-checking the room before you grade anyone in it.

LeoWhich is a maintenance tax. Somebody has to keep running that, keep pruning the rotten tasks, keep the pipeline alive every single month.

MayaThat's the second limit, and it's structural. A static benchmark is published once and it's done. A live benchmark is a *commitment*. The automated pipeline lowers the cost, but it doesn't zero it — environments break, the setup agent fails on some repos. The price of never going stale is never being finished.

LeoAnd there's a third one I can feel coming. If you're slicing fresh issues every month, each snapshot is small. You said batches of dozens, not the thousands the original threw at you at once.

MayaThat's the third, and it's real. Any single month's fresh batch is a small sample. Small samples are noisy — a handful of lucky or unlucky task draws can swing a percentage. So you read the rolling number as a *trend across snapshots*, not as one precise verdict from one month.

LeoSo the frozen splits are where you get a stable number, and the live stream is where you watch the trend line and check that nobody's memorizing.

MayaYou've got the mental model. And notice how this lands the topic-level debate we keep circling. One camp wants a strong, stable, public leaderboard so the whole field shares a ruler. The other camp says any public ruler eventually gets gamed and contaminated, so you need something fresh that nobody trained on.

LeoAnd SWE-bench-Live is basically that second camp building a tool — but quietly borrowing from the first by keeping frozen splits for comparability.

MayaWhich is the most honest version of the disagreement, I think. It's not "pick a side." It's "use a fresh stream to catch the cheaters, and a frozen slice to rank the contenders." The strongest case for each is true at the same time.

LeoLet me tie it back to our running thread, too. Same lesson as last time — the score isn't the product.

MayaRight. Take our little team shipping a fix to that open-source data library. If they evaluate on a live, uncontaminated issue, the valuable artifact isn't "we resolved it." It's the whole replayable task underneath — the fresh issue, the container the setup agent built, the hidden tests, the trajectory, the pass-fail labels, and the date stamp proving the model couldn't have seen it.

LeoCollect the work trace, not the headline. And here the date stamp is part of the trace.

MayaThe date stamp *is* the evidence. On a static benchmark you'd always wonder "did it just remember this?" On a live one, the calendar answers for you. That's the real gift — not a higher number, but a number you're finally allowed to read as skill instead of memory.

LeoThough only as far as the freshness holds. The minute this month's tasks leak into next year's training scrape...

Maya...they're contaminated too, and they retire into the static pile. Which is exactly why the thing has to keep moving. A live benchmark isn't a place you arrive. It's a treadmill you commit to walking.

LeoHere's a question to sit with, then. If keeping a benchmark fresh means committing to maintain it forever — pruning rotten tasks, rebuilding broken environments every month — at what point does the upkeep cost more than the contamination it's protecting you from?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents