T3E6 · Jun 26, 2026 · 00:15:03

Harness Engineering in an Agent-First World

What were the humans doing if agents wrote every line? Maya and Leo dig into OpenAI's harness-engineering report — a five-month run that shipped roughly a million lines of code across about fifteen hundred pull requests with no code written by human hands. The episode names the four jobs the environment had to do: the Map (a ~100-line agent file as a table of contents, because context is scarce), the Rails (mechanical enforcement via layered architecture and linters whose errors carry the fix), the Mirror (verification loops that give the agent senses and hard thresholds), and the Janitor (scheduled agents that clean up AI slop). Leo steel-mans the cost — the heavy harness ran a lot more expensive than a solo agent — and they settle on what's transferable (the discipline) versus local (the economics), landing on the season's turn: in an agent-first world, you build the garage, not the car.

Transcript

MayaA team ships a real product. A million lines of code — the app, the tests, the CI config, the docs, the observability. It's running, people are using it. And not one of those lines was typed by a human.

LeoHold on.

MayaFive months. Started as three engineers, grew to seven. Around fifteen hundred merged pull requests. And the humans — they didn't write code. Not as a stunt. As the actual working model.

LeoOkay, but somebody wrote *something*. You don't get a million lines from three people saying "go."

MayaThat's exactly the right pushback, and it's the whole episode. They wrote a *lot*. Just not the code. They wrote the thing around the code. The environment the agent runs inside.

LeoThe harness.

MayaThe harness. This is the OpenAI write-up — leveraging their coding agent in what they call an agent-first world. And the one-line thesis is almost a slogan: humans steer, agents execute.

LeoSo last time we had the Maker and the Critic — two agents, one builds, one tears it apart, and the contract between them defines "done."

MayaRight. That was about how a single long build judges itself honestly.

LeoAnd today?

MayaToday zooms out one ring further. Forget one build. Imagine the whole codebase, every day, for months, mostly written by agents. The question stops being "how does one agent grade its work" and becomes "what does the *place* it works in have to look like so it can keep working — reliably — without a person in the loop on every line."

LeoSo the unit isn't the task anymore. It's the environment.

MayaThe environment is the product. And building that environment well — that's the discipline the piece is naming.

LeoNaming, as in it's new?

MayaNew-ish. The term gets credited to Mitchell Hashimoto — earlier in the year he wrote up a habit: every time an agent screwed up, instead of just fixing the bug, he'd engineer a permanent fix into the agent's *environment*, so that mistake became structurally impossible to make again.

LeoFix the world, not the answer.

MayaFix the world, not the answer. And then OpenAI ran with it at production scale. Their piece is the one that says: here's what that actually looked like across a whole codebase.

LeoOkay. Let me steel-man my own skepticism for a second, because "we wrote a million lines and no human touched the code" is the kind of sentence that's either the future or a press release. Which is it?

MayaBoth, probably. Let's earn it. I want to give you the four jobs the harness had to do, because each one is a place where they hit a wall and built their way out instead of grabbing a keyboard.

LeoName them.

MayaThink of it as four roles. The Map — how the agent finds its way around a codebase too big to hold in its head. The Rails — what physically stops it from going off the road. The Mirror — how it sees whether its own work actually works. And the Janitor — what cleans up the mess it inevitably makes.

LeoMap, Rails, Mirror, Janitor. Go.

MayaThe Map first. Their instinct — everybody's instinct — is to write one giant instructions file. The agent file. Everything the agent should know, all in one place. Architecture, conventions, gotchas, history.

LeoSounds responsible.

MayaIt failed. And the reason it failed is the line I want everyone to keep: context is a scarce resource. The model only gets to look at so much at once. Stuff that file with everything, and you've crowded out the actual task, the actual code, the actual relevant doc. When everything's marked important —

Leo— nothing is.

MayaNothing is. So they did the opposite. They shrank the agent file down to about a hundred lines and made it a table of contents. Not the encyclopedia — the index. It doesn't *tell* the agent everything; it tells the agent where to look next.

LeoHuh. So smaller is the feature.

MayaSmaller is the feature. And the rest of the knowledge — design decisions, specs, execution plans — gets pushed into a structured docs folder the agent can navigate on demand. Pull in what this task needs, leave the rest on the shelf.

LeoThat has a name in the human world. Progressive disclosure. You don't read the whole manual; you flip to the page.

MayaSame idea, and there's a sharper version of it underneath. From the agent's point of view, anything it can't access in context while it's running effectively doesn't exist. So your design rationale that lives in someone's head? Doesn't exist. The decision buried in a chat thread three weeks ago? Doesn't exist.

LeoSo the repo becomes the only reality.

MayaThe repo is the only reality. If you want the agent to know it, it has to be a file the agent can open. Versioned, in the tree, on demand.

LeoOkay, that one I buy, and I'll tell you why — it's not even agent-specific. That's just good documentation hygiene that humans were always too lazy to do. The agent just doesn't *let* you be lazy.

MayaThat's a recurring shape in this whole piece, actually. The agent doesn't invent the discipline. It removes your ability to skip it.

LeoAlright. The Rails.

MayaThe Rails. Here's the failure mode that surprised me. Agents are pattern-matchers. Drop one into your codebase and it will faithfully copy what's already there — including the bad parts.

LeoOh no.

MayaSo if you've got one slightly-wrong pattern, the agent doesn't fix it. It reproduces it. Forty times. And the next agent copies *those*. Bad patterns don't just persist — they compound.

LeoSo you can't fix it with a polite note in the agent file. "Please don't do the bad thing."

MayaRight, because that's a suggestion competing for scarce context. What they did instead was make the bad thing *mechanically impossible*. A strict layered architecture — types, config, data, service, runtime, interface, in one allowed direction — and then custom linters and structural tests that physically block a pull request if a layer reaches the wrong way.

LeoThe linter is the rail. You don't ask the agent not to swerve; you put up a guardrail.

MayaAnd here's the part I love. The error messages aren't just "rejected." They carry the fix. The lint failure says, in effect, "the service layer can't import from the interface layer — move this through a provider." So when the agent fails, the failure itself teaches it what to do instead.

LeoOh, that's clever. The error message *becomes context*. The agent fails, reads the rejection, and the rejection is already the next instruction.

MayaThe guardrail talks back. And — this is the agent-first part — the linters that enforce all this? The agents wrote those too.

Leo{chuckle} Of course they did.

MayaThe Mirror next. How does an agent know its code actually works — not compiles, *works*?

LeoThis is the t3e5 problem again. The agent that grades its own broken game "fully playable."

MayaSame demon, bigger room. Their answer was to give the agent senses. They plugged the browser's developer protocol straight into the agent's runtime. So now the agent can launch the app in its own isolated instance, click through it, snapshot the screen before and after, read the runtime logs, query the metrics.

LeoSo it's not grading a screenshot. It's using the thing.

MayaIt's using the thing. And then you give it hard, mechanical thresholds it can check itself against. One they mention — a service had to start in under eight hundred milliseconds. Not "feels fast." A number a machine can verify.

LeoA test it can't argue with.

MayaRight — and that flips the whole problem. The agent doesn't need taste. It needs a signal. Tests pass, types check, build's green, page actually responds. Give it that loop — act, check against concrete criteria, feed the result back, repeat — and it'll just grind. They had single runs going six hours straight. Overnight. While people slept.

LeoOkay, this is where I want to plant a flag, because I've been nodding too much. Six-hour autonomous runs, a million lines, no human code — and yet there's a catch in this piece that I think you're skating past.

MayaSay it.

LeoThe cost. The fuller harness — the version with the planner and the separate evaluator on top — they're candid that it ran a lot more expensive than just one agent doing it solo.

MayaIt's in there.

LeoA lot more. So I want to be honest about what we're actually claiming. We're not claiming agents made coding free. We're claiming they made it *possible* to throw a pile more compute at a problem and get something usable out the other end. Those are really different sentences.

MayaThey are, and I won't dodge it. The way they frame that extra cost is: it moved the output from unusable to usable. Without the heavy harness, you got slop. With it, you got a shippable thing. So the cost isn't buying speed — it's buying *reliability you couldn't get at any price* from the cheap version.

LeoFine. The reliability claim survives. But concede me the flip side: this only pencils out where compute is cheaper than the engineer you'd otherwise pay. That's true at a frontier lab with its own models. It is not obviously true for a four-person startup renting tokens.

MayaConceded — partially. The exact economics are theirs, their infrastructure, their models. Your startup's number will differ. But here's where I push back: the *structure* is the transferable part, not the price tag. The map, the rails, the mirror — those make agents more reliable whether you run a heavy harness or a light one. The cost ratio is a knob. The discipline is the lever.

LeoOkay. I'll take that trade. The dollar figure is local; the four jobs probably aren't. That's a claim I can live with.

MayaAnd honestly your skepticism is *in* the piece — which is the fourth job. The Janitor.

LeoRight, the mess.

MayaThey were candid: early on, something like a fifth of every week went to cleaning up what they called the slop. Agents copying mediocre patterns, generating plausible-but-wrong structure. So they automated the cleanup. Background agents running on a schedule, scanning the codebase for deviations from the rules, and opening small refactor pull requests on their own.

LeoAgents cleaning up after agents.

MayaAnd most of those cleanup pull requests were reviewable in under a minute, because each one's tiny and targeted. The phrase they use for the whole move is the one I'd frame: human taste is captured once, then enforced continuously on every line of code.

LeoSay that one again slower, because that's the actual thesis hiding under the million-line headline.

MayaHuman taste is captured once — you, the engineer, decide what good looks like one time, and you encode it. As a lint rule, an architectural constraint, a threshold, a doc the agent reads. And then it's enforced continuously, automatically, on everything, forever. You're not in the loop on each line. You're in the loop on the *rules*.

LeoSo the job changed. The discipline used to live in the code — the abstractions you chose, the tests you wrote, the style you held.

MayaAnd now the discipline lives in the scaffolding. The tooling, the docs structure, the feedback loops, the constraints. You stopped being the person who writes the careful line and became the person who builds the place where careful lines are the only ones that survive.

LeoWhich is — and I mean this as the compliment and the warning — basically infrastructure work.

MayaIt is. The skill being rewarded isn't typing the function. It's designing the environment so that a tireless, fast, slightly-careless worker produces good output anyway. That's not prompting. The piece is pretty blunt that the hard part was never prompting. The hard part is building the environment where agents can do reliable, sustained work.

LeoSo the punchline of the whole topic is: the crew that builds the garage matters more than any single trip into it.

MayaThat's the turn. All season we've talked about the model as the driver and the harness as the crew. This episode is the crew building the *garage* — the rails, the lifts, the diagnostic bay — so that any driver who pulls in produces a car that runs. You're not winning one race. You're making the place where every race gets won by default.

LeoAnd the honest asterisk is the one I planted. That garage costs real money to run, and whether it's worth it depends on who's paying for the electricity.

MayaRight. The discipline generalizes. The economics are local. Both true.

LeoSo let me try the close. If the agent only knows what's written in the repo — if anything outside the tree effectively doesn't exist to it —

Maya— then the question for anyone building this way is almost uncomfortable. If you had to move everything your team knows into the open — into files, into rails, into the environment itself, where a machine could actually read it — how much of what you call expertise would survive being written down?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents