SE0 · Series · May 30, 2026 · 00:09:06

Series Overview — The New Data Unit for Coding Agents

The series overview lays out the shared mental models and expert disagreements that recur throughout the podcast.

Transcript

MayaHere is the whole series in one sentence: the future of coding-data work is not just more code. It is better records of software work.

LeoThat's a big claim. More code has been the default answer for a long time.

MayaIt has. And more code still matters. But when we talk about agentic coding, we're talking about something larger than code completion. We're talking about systems that can work through software tasks by using tools.

LeoSo let's set the north star.

MayaThe north star of this podcast is: what data, evaluation signals, harnesses, and workflows help training teams improve agentic coding capability?

LeoAnd by "agentic coding capability," we mean…

MayaThe ability to take a software task, operate inside a repository and environment, make decisions, use tools, observe feedback, revise the plan, and produce reviewable work.

LeoLet me compare two tasks. First: "Write a Python function that checks whether a string is empty." Second: "Our checkout flow accepts blank shipping addresses. Fix it, update tests, and explain the change."

MayaPerfect contrast. The first task can be answered in a snippet. The second task requires repository-level work. The agent may need to inspect the frontend form, backend API, shared validation helper, test suite, and maybe even configuration. It has to run commands and interpret failures.

LeoThat's our first shared mental model: agentic coding is software work in an environment.

MayaExactly. The old coding-model frame was: prompt to code answer. The agentic coding frame is: task to repo to environment to tool-using trajectory to verifier signal to reviewed patch.

LeoTrajectory is one of those words we'll use a lot. Define it slowly.

MayaA trajectory is the path of an attempt. It includes what the agent searched for, what files it opened, what commands it ran, what outputs it saw, what edits it made, what tests failed, what it tried next, and what final patch it produced.

LeoSo the trajectory is basically the work log.

MayaYes. And that leads to the second mental model: we care about the path, not only the patch. A final patch can pass because the agent reasoned well. Or it can pass because the visible tests were weak. Or it can fail even though the agent had a good process but missed one edge case. If we only store final diffs, we lose the most teachable part of the work.

LeoGive me an example of two agents with the same final result but different training value.

MayaAgent A searches for "shipping address," opens the validation helper, checks existing tests, adds a minimal rule for empty and whitespace-only strings, runs targeted tests, then runs the broader suite. Agent B edits random files, breaks a test, reverts something, accidentally lands on a passing patch, and gives a vague summary.

LeoBoth might pass.

MayaRight. But Agent A teaches a disciplined workflow. Agent B teaches us about failure modes. Both traces are useful, but for different reasons.

LeoMental model three: the harness is part of capability.

MayaYes. The model is not the whole system. The harness decides what the model sees, what tools it can use, how files are displayed, how edits are applied, how terminal output is summarized, what memory is kept, and how progress is logged.

LeoSame model, different harness, different agent.

MayaExactly. That is why this series spends time on interfaces, product workflows, and harness engineering. A model with clumsy file access may look confused. The same model with good search, scoped file viewing, reliable edits, and useful feedback may look much more careful.

LeoMental model four: evaluation is not one number.

MayaA leaderboard score matters, but it is not the whole story. Evaluation should ask: did the task pass? Were hidden tests strong? Was the patch maintainable? Did the agent use tools well? Did it recover from errors? Did it avoid unsafe commands? Would a human reviewer merge it?

LeoThat helps explain why the series has both benchmark episodes and verifier episodes.

MayaYes. Benchmarks give shared measurements. Verifiers give signals. Human reviews add judgment. Tests add concrete feedback. Each is useful, and each can fail.

LeoMental model five: data for agentic coding is replayable engineering work.

MayaThe best training-data record is not just the prompt and final code. It looks more like: task, repo snapshot, environment, public tests, hidden tests, agent trajectory, final patch, outcome, reviewer notes, failure labels, safety flags.

LeoThat sounds heavier to collect.

MayaIt is. But it is also more valuable. If a training team wants an agent that can recover from failed tests, they need examples of failed tests and recovery. If they want safer agents, they need unsafe-command examples and safe alternatives. If they want better repo navigation, they need traces of file search and localization.

LeoNow let's talk disagreements. Because this field is not settled.

MayaFirst disagreement: are better coding agents mostly about better model weights, or better harnesses? The model-weight side says stronger base models generalize across tools and tasks. If the model can reason better, it will need less scaffolding. The harness side says the surrounding system still matters enormously. Tools, memory, context construction, validation, and logs can change behavior even with the same model.

LeoSecond disagreement: single agent or multi-agent?

MayaSingle-agent advocates argue it is easier to debug one worker. There is less coordination overhead and fewer conflicting edits. Multi-agent advocates argue that investigation can be parallelized. One subagent can reproduce a bug, another can inspect docs, another can look at recent commits, and a lead agent can synthesize.

LeoThird disagreement: tests or LLM verifiers?

MayaThe test-first side says executable tests are objective and concrete. The verifier side says many important qualities are not captured by tests: minimality, maintainability, whether the patch matches the request, whether the explanation is honest.

LeoFourth: human data or synthetic data?

MayaHuman traces are realistic: real ambiguity, real review comments, real project constraints. Synthetic tasks scale better and can be built with known labels, controlled environments, and systematic coverage. Most serious programs will probably need both.

LeoAnd this podcast is not going to pretend every answer is obvious.

MayaExactly. We'll give you the mental models, the evidence, and the practical questions to ask.

LeoLet's preview the path.

MayaTopic 1 builds foundations: what makes coding agentic, why repositories matter, what the Agent-Computer Interface is, and why trajectories become training data. Topic 2 is evaluation: benchmarks, hidden tests, leaderboards, and practical rubrics. Topic 3 is product workflows and observability: cloud agents, branches, logs, review, and multi-agent delegation. Topic 4 is harness engineering: how the system around the model changes capability. Topic 5 is execution and world-model-style training signals. Topic 6 studies prior experiments that improved coding models and coding agents. Topic 7 turns those lessons into data products for training teams. Topic 8 covers reliability, weak tests, reward hacking, verifiers, and safety. Topic 9 is the capstone: what a real agentic coding data pack should contain.

LeoSo the series starts with concepts and ends with deliverables.

MayaYes. We want listeners to leave with a practical answer to a practical question: if your job is to help model-training teams improve agentic coding, what should you collect, label, evaluate, and preserve?

LeoLet's close with the reflection question.

MayaWhen you look at a coding-agent demo, what would you rather see: the final patch only, or the full path the agent took to get there? And what would that path tell you that the patch alone cannot?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents