T3E0 · Jun 20, 2026 · 00:19:30

Agent Systems, Harnesses, and Observability

A topic-overview map of the coding agent as a system, not a model. Maya and Leo walk four pillars — product-style async workflows that turn a task into a reviewable pull request, single-vs-multi-agent orchestration, harness engineering (context, tools, memory, validators), and observability as the flywheel that feeds debugging, evaluation, training, and safety. Built around one running example: a pit crew servicing a single race car — a payments service silently dropping one in a thousand transactions — where the model is the driver and the harness, orchestration, and telemetry are the crew that actually wins the race.

Transcript

MayaHere's something that should be impossible. Same model weights. Same task — fix a flaky test in a mid-size repo. Two teams run it. One agent finishes in twenty minutes with a clean diff. The other thrashes for an hour, edits the wrong file twice, and gives up. And nothing about the *brain* changed between them.

LeoWait, nothing?

MayaNothing in the weights. What changed was everything *around* the model — which files it was shown, what its search tool returned, whether it could remember what it already tried, whether a validator caught the bad edit before it compounded. That wrapper has a name.

LeoThe harness.

MayaThe harness. And that's the turn this whole topic makes — a coding agent is not a model. It's a *system*: a model, plus a harness, plus tools, plus memory, plus an environment, plus the logs that let you see what actually happened.

LeoCareful, though — "the system matters" is the kind of thing nobody disagrees with and nobody learns from. What's the *specific* claim?

MayaFair. The specific claim: the harness is part of capability. Hold the model fixed and you can still move the score — up or down — just by changing what it sees, remembers, and gets to validate. The harness isn't packaging around the intelligence. Some of the intelligence lives in the harness.

LeoThat I'll chew on. Because if it's true, a lot of "our new model is X percent better" is actually "our new *plumbing* is X percent better."

MayaAnd that's exactly the fight — where do the gains actually come from. Steel-man both sides before we pick it apart. The weights camp says: look at the jump between model generations. That dwarfs anything you get from tweaking the wrapper. Give a strong model a mediocre harness and it still degrades gracefully — give a weak model the best harness on earth and you can't rescue it. Weights are the lever that scales.

LeoAnd the harness camp's strongest form?

MayaHold the model completely fixed. Change only what it sees, the shape of its tools, what gets validated — and the same task swings up or down in front of you. If the wrapper can move the result, then some of the capability *is* a harness property, and calling every gain a weights gain mismeasures what actually changed.

LeoTwo real positions, both with teeth. Don't resolve it yet — let's earn it. We'll come back and actually fight it out once we've built the vocabulary.

LeoBefore we go deeper — where are we in the arc? Topic one was what the work *is*. Topic two was how we *judge* it.

MayaRight, and notice the gap that leaves. Topic two taught us to score the patch and the trajectory. But it mostly assumed the agent *runs* somehow. Topic three opens the box marked "somehow." How does a task actually become a branch, a set of commands, a diff, a pull request a human can review? That machinery is the topic.

LeoSo less "is the answer good" and more "what is the contraption that produces the answer, and how do we watch it work."

MayaThat's the frame. And the watching part is not a footnote. It's load-bearing — we'll end the whole topic on it.

LeoLet's define the load-bearing words first, plain version, because three of them get thrown around interchangeably and they're not the same.

MayaGood call. Start with *agent*. An agent is a model running in a loop: it looks at the situation, decides on an action, takes it, observes what happened, and goes again. Inspect, act, observe, repeat. The loop is the thing.

LeoAnd the *harness* — separate from the agent?

MayaThe harness is everything that decides what the loop gets to see and do. What context goes into the prompt. Which tools exist and what shape their outputs take. What gets validated before it counts. What gets written down for next time. If the agent is the player, the harness is the entire stadium, the rulebook, and the coach feeding plays.

LeoNice.

MayaThen *observability*, the one people skip. It's whether, after the run, you can reconstruct what happened — every plan, tool call, command, file read, diff, and test result — well enough to debug it, grade it, or train on it.

LeoSo a run with no observability either worked or didn't, and you'll never know why.

MayaAnd you can't learn from it. That's the quiet failure mode of this whole space.

MayaLet me plant one picture we'll keep coming back to, because abstract "systems" talk gets slippery fast. A pit crew working on one race car.

LeoOkay, the car being—

Maya—the task. One concrete job: a payments service is dropping one in a thousand transactions under load, and an agent has to find it and fix it. The *driver* is the model. But a driver alone doesn't win. There's a crew chief deciding what the driver hears over the radio, a pit wall handing over exactly the right tire and nothing else, and a telemetry feed recording every lap to replay tonight.

LeoI'm with you.

MayaCrew chief is the harness deciding context. The pit wall handing one tool at a time is tool design. Telemetry is observability. Same driver, different crew — different race. We'll bring the dropped-transaction car back at every stop.

LeoGood one, actually — that's the kind of bug you can't find by staring at code. You have to run things, watch, retry.

MayaWhich is exactly why it needs a system, not just a smart model.

MayaLet's start where it's most concrete — how production systems actually turn a task into reviewable work. This is the part that's already shipped; you can use it today.

LeoWalk me through the shape, not a feature list.

MayaThe shape is: you hand an agent a task, often the way you'd hand it to a junior engineer — an issue, a sentence, a ticket. It goes off *asynchronously*, meaning you close the laptop and it keeps working. It spins up its own workspace, reads the repo, makes changes, runs the checks, and comes back with a branch and a pull request.

LeoSo the deliverable isn't an answer in a chat window. It's a PR.

MayaThat's the crucial reframe. The unit of work is reviewable engineering output — a branch, a diff, passing or failing tests, and a log of what it did. The systems behind today's episode — GitHub's coding agent, OpenAI's Codex — both land here. Task in, pull request out, human reviews.

LeoLet me push on "asynchronously," though, because that word is doing heavy lifting. If I can't see it while it works, the only thing protecting me is what it shows me *after*. So the review surface has to be honest.

MayaAnd that's the whole bet. The async agent is only as trustworthy as the artifacts it hands back. A green checkmark with no trace behind it is worse than useless — it's a confident black box. Which is exactly why the watching part matters, and we'll close on it.

LeoYou keep teeing up that closing piece. I'm going to hold you to it.

MayaNext knob: orchestration. When you've got a big, fuzzy job, do you run one coherent agent, or do you split it — a lead agent that plans and delegates to subagents that go investigate, reproduce, research, and report back?

LeoAnd this is a real debate, not a settled thing.

MayaThis is a *real* debate. Let's actually stage it, because the strongest version of each side teaches more than a summary. I'll take the multi-agent side. You take single-agent. Go after me.

LeoHappy to. My claim: for coding, one agent should own the work end to end. The moment you split investigation across subagents, you fragment context. Subagent A learns the bug is a race condition; subagent B, who's writing the fix, never sees that insight because it lived in A's separate context window. You spend half your tokens just re-explaining the problem to yourselves.

MayaThat's the strongest objection and I'll grant the core of it — context fragmentation is the real tax. But here's the multi-agent case. Some work is *embarrassingly parallel*. "Search these forty files for every place this API is called" — that's not one deep reasoning chain, that's forty shallow lookups. A lead agent farms those out, each subagent burns its own context budget in parallel, and you get back a synthesis in a fraction of the wall-clock time. The reported wins on broad research-style tasks are large.

LeoFine — for *search*, breadth-first, I'll concede it. Fan-out beats one agent reading forty files one at a time. But that's reconnaissance, not surgery. The actual patch?

MayaAnd there I'll concede *to you*. This is the part the field has mostly converged on: investigation can parallelize, but patch ownership should centralize. One agent holds the pen on the final diff. You don't want three subagents editing the same file and merging their own conflicts.

LeoSo the resolution isn't "one versus many." It's "many eyes, one hand."

MayaThat's it exactly — many eyes, one hand. Delegate the looking, centralize the changing. And the thing that would settle the rest of the disagreement is just better numbers: on which task *shapes* does fan-out actually pay for its coordination overhead. We have anecdotes; we want curves.

LeoAnecdotes want to be curves when they grow up.

Maya[chuckle] Put that on a mug.

MayaNow the heart of the topic: harness engineering. Everything we've circled — context, tools, memory, validation — it's all knobs on the harness. And the deep idea is that those knobs are *engineerable*. You can design them deliberately, and increasingly, you can let a search process *optimize* them for you.

LeoBreak "context" down, because that's the one that sounds simple and isn't.

MayaSounds like "put the relevant stuff in the prompt." But the context window is scarce, contested space. Too little in, the agent flies blind. Too much, the signal drowns — the one important file buried under thirty irrelevant ones, and the model loses the thread. So the harness is constantly deciding: what to retrieve, summarize, drop, or carry forward.

LeoRight.

MayaMemory's the same fight across *time*. On a long task, the agent can't hold the whole history in one window, so the harness decides what gets compressed into a durable note — "already tried restarting the service, didn't help" — versus what gets forgotten. Bad memory, and the agent loops, re-trying what it tried an hour ago.

LeoHere's where I get skeptical, though. "Automatically optimized harness" sounds suspiciously like a perpetual-motion machine. You're using a search process to tune the thing that wraps the model — what stops you from just overfitting the harness to your eval set?

MayaThat's the right worry, and the honest answer is: nothing automatic. It's the same contamination trap from topic two, one level up. If you tune the harness against a fixed benchmark, you get a harness that's brilliant at that benchmark and brittle everywhere else. The meta-harness work that's coming later in the topic is exactly an attempt to optimize the harness end to end *without* that trap — and whether it generalizes is genuinely open.

LeoSo harness engineering is a real discipline, but it inherits all of evaluation's ways to fool yourself.

MayaEvery single one. A better wrapper and a wrapper that *looks* better on your dashboard are not the same object.

LeoOkay. Now we've got the vocabulary. Time for that fight you parked at the top — where the gains actually come from. And I'll take weights, because I think the harness people oversell their hand.

MayaThen I've got the harness. Come at me.

LeoSimple. Look at the curve across model generations. The jump from one to the next *dwarfs* anything you squeeze out of context tricks and tool tweaks. Hand me a frontier model with a sloppy harness and it still limps to a decent answer. Hand me last year's model wrapped in the most exquisite harness you can build —

Maya— and it still can't do the thing, yeah, I know the move. And honestly? That part survives. A weak model is a weak model; no wrapper *rescues* it. I'll grant you the floor.

LeoThank you.

MayaBut don't take the ceiling with it. Hold the model *completely* fixed — same weights, same everything — and just change what it sees, what its tools return, what gets validated before an edit counts. The score on the *same task* moves. Up and down, in front of you. If the wrapper can swing the result with the brain held constant, then some of the capability is living in the harness. Full stop.

LeoFine — fine. That demo's real, I've watched it happen. The swing exists.

MayaSo?

LeoSo I'll move. Not all the way — the big *jumps* are still weights, I won't give that up. But "the harness is just plumbing"? That I retract. It's not packaging. It's part of the engine.

MayaAnd I'll meet you there. The weights set the ceiling; the harness decides how much of that ceiling you actually reach on a given task. Neither one alone is the whole number.

LeoSo the headline "our new model is X percent better" is doing two jobs at once and hiding the seam.

MayaWhich is the part that's genuinely unsettled — and what would settle it is boring and specific: hold the model fixed, vary only the harness, and *measure* the swing across many task shapes. Until someone runs that cleanly, we're both arguing from anecdotes.

LeoAnd until then we agree on the shape, not the split. Ceiling from weights, reach from the harness — measure the rest later.

LeoOkay. The closing piece. You've promised it about four times now.

Maya[laugh] And here it is. Observability is the discipline of recording the run so completely that someone who wasn't there can replay it. Run identifier, the plan, every tool call, every command and its output, every file read, every diff, every test result, every reviewer event.

LeoWhy does that matter so much that you'd put it on equal footing with the rest, and not call it a logging detail?

MayaBecause of who *uses* the trace — four teams from one recording. The engineer debugging *why* it failed today. The eval team scoring trajectory quality, the way topic two needed. The training team turning successful and failed runs into data. The safety team checking it didn't run something it shouldn't have.

LeoSo the trace is almost the product. The patch is one output; the recording is the asset that keeps paying.

MayaThat's the reframe I want you to walk away with. A coding agent that produces great patches and no usable trace is a one-time win. A coding agent with rich, structured traces is a *flywheel* — every run, success or failure, becomes fuel for the next model and the next harness.

LeoCounterpoint, and it's a real cost: traces are heavy. Full observability on every run is a lot of storage, a lot of plumbing, and a lot of sensitive data — private code, secrets, customer context flowing through the logs.

MayaCompletely fair, and that's a genuine tension, not a solved one. Rich traces are a training goldmine *and* a governance liability in the same breath. The teams doing this well treat the trace schema as a first-class design problem — what to capture, what to redact, what to keep — not an afterthought. That governance thread is where this whole series eventually lands.

MayaBefore the close, let's nail down the words, fast and plain, so they're solid when the deep dives use them.

LeoHarness means everything around the model that decides what it sees, remembers, and can do.

MayaTrajectory means the recorded path of a run — searches, reads, commands, edits, retries — not just the final diff.

LeoAsynchronous agent means an agent you hand a task and walk away from; it works on its own and returns a finished branch.

MayaOrchestration means how work is split between a lead agent and subagents — who plans, who investigates, who holds the pen.

LeoContext window means the limited space of text the model can attend to at once — the scarce desk it has to work on.

MayaMemory means what the harness chooses to carry forward across steps so the agent doesn't repeat itself.

LeoObservability means how completely you can reconstruct a run afterward, for debugging, evaluation, training, or safety.

MayaPatch ownership means which single agent is responsible for the final diff, even when investigation was shared out.

MayaSo the map for the rest of the topic: we'll watch the shipped product workflows up close — GitHub's coding agent, then Codex. We'll dig into Anthropic's and OpenAI's writing on building multi-agent systems and effective harnesses. And we'll end in the deep water — the meta-harness paper, the harness problem, harness engineering as a named discipline.

LeoFour pillars to carry in: the product workflow, the one-versus-many question, the harness knobs, and the trace that watches it all.

MayaSame pit crew the whole way. The driver barely changes. The crew is where the race is won.

LeoWhich leaves the question I keep landing on. If you could only upgrade *one* thing on a coding agent for the next year — the model in the driver's seat, or the crew around it — and you had to live with that choice across every task your team throws at it... which one actually moves more races?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents