Transcript
MayaA bug report lands on a team's desk: customers can submit an empty shipping address, but only when they use express checkout.
LeoThat sounds like a small validation bug until someone asks where the rule actually lives.
MayaExactly. It might be in the React form, a backend endpoint, a shared schema, a payment integration, or a test fixture that hides the real behavior. That is the opening move for Topic 1.
LeoSo this is the foundation before we get to leaderboards and training data.
MayaRight. Agentic coding means a model is not only producing a code answer. It is working a software task through a repository, tools, environment feedback, tests, and eventually a reviewable patch.
LeoPlain language version: it has to behave more like a junior engineer in a repo than a calculator for code snippets.
MayaGood. And we are going to use that express-checkout bug as the reusable example. The agent must find the right files, understand the current validation pattern, edit safely, run checks, recover from failure, and leave a trail another engineer can inspect.
LeoGive me the map for the topic.
MayaThe first landmark is the Task Door. A completion model answers the text in front of it. A coding agent walks through a task door into a messy workspace. The request is incomplete, the repo is large, and the correct fix may require discovering hidden context.
LeoIn our checkout example, the task door says, "Blank express addresses get accepted." It does not say, "Open `address_schema.py` and add this exact condition."
MayaExactly. The second landmark is the Repository Map. Repository-level work is harder because the answer is distributed. The agent needs search, file reading, dependency clues, tests, and conventions. A local edit can break a distant caller.
LeoSo a model can be brilliant at writing a validator and still fail if it patches the wrong layer.
MayaThat is the core distinction. The third landmark is the Feedback Loop. The agent acts, observes, and revises. It runs tests, reads command output, notices lint failures, and changes plan. Without feedback, it is guessing.
LeoThis is where failed attempts become useful instead of embarrassing.
MayaYes. The fourth landmark is the Trace Ledger. The final patch is only one artifact. The trajectory records searches, opened files, commands, outputs, edits, retries, tests, and outcomes. For training and evaluation, that ledger can reveal why a patch worked or failed.
LeoSo the topic's big claim is not "more code is better data." It is "replayable engineering work is better data."
MayaPrecisely. The final curriculum states the higher-value unit as a task, repo snapshot, executable environment, trajectory, verifier signal, human review, failure labels, and governance metadata.
LeoWhere do serious experts agree?
MayaThey usually share three mental models. One: software work is situated. The same code change can be right in one repo and wrong in another. Two: tools shape behavior. Search, file viewers, terminals, validators, and error messages change how an agent thinks. Three: evaluation has to look at both outcome and process, because a lucky patch and a careful patch can produce the same green test.
LeoThat last one matters for training. If we only keep the green final diff, we lose the path.
MayaExactly. Now, experts disagree on what deserves the most investment. One debate is model weights versus harness. The model-first camp says the durable gains come from stronger reasoning, better code understanding, longer context, and more robust planning inside the model.
LeoTheir strongest argument is portability. A stronger model helps across tools, repos, and products.
MayaYes. The harness-first camp says the same model can look careful or chaotic depending on the interface. If the agent gets poor search, confusing file views, noisy terminal output, and weak validation, capability leaks away before it becomes action.
LeoTheir strongest argument is operational. Software work is interactive, so the interface is part of the intelligence on display.
MayaA second debate is clean success data versus full trajectories. The clean-data camp wants successful patches, because they are easier to supervise and less likely to teach bad moves.
LeoTheir strongest argument is signal quality. Show the model what good work looks like.
MayaThe trajectory camp says failures are where agentic skill becomes visible. Mislocalization, bad assumptions, broken commands, overbroad edits, and recovery attempts are exactly the behaviors future systems need to learn from.
LeoTheir strongest argument is realism. Real agents do not go straight from issue to perfect patch.
MayaThe primary sources for this topic anchor those disagreements. SWE-bench made repository-level issue resolution concrete. SWE-agent argued that an Agent-Computer Interface can change performance. Aider's benchmarks show the importance of edit formats, test feedback, and second attempts. GitHub Copilot's cloud-agent docs show how agent work becomes a product workflow with branches, logs, pull requests, and human review.
LeoAnd before those source episodes, we have three concept episodes: the shift from generation to agentic work, the Agent-Computer Interface, and trajectories as data.
MayaRight. Those concept episodes give the vocabulary. The source episodes show the vocabulary in actual systems and benchmarks.
LeoLet's tie it back to the checkout bug. What would a good agentic run look like?
MayaIt starts by clarifying the task from the issue. It searches for express checkout and address validation. It reads the smallest relevant files before editing. It identifies whether the shared schema, backend validator, or frontend form is authoritative. It patches the rule with minimal blast radius. It adds or updates a test. It runs focused checks, then broader checks if needed. If a test fails, it studies the failure instead of thrashing. Finally, it summarizes the change and leaves enough evidence for review.
LeoAnd the trace ledger would preserve the run, not just the diff.
MayaYes. That is why Topic 1 is the base layer for the rest of the series. Evaluation, harnesses, training signals, data products, and safety all depend on this frame.
LeoThe listener should leave able to watch a coding-agent demo and separate the surface magic from the actual work system.
MayaBefore we close, let's make the vocabulary portable.
LeoAgentic coding means model-driven software work that moves through a repository, tools, feedback, and reviewable output.
MayaHarness means the surrounding system of prompts, tools, context, permissions, validators, and logs that lets the model act.
LeoAgent-Computer Interface means the workbench the model uses to search, read files, edit, run commands, and observe results.
MayaTrajectory means the recorded path of an attempt: actions, observations, edits, tests, failures, retries, and outcome.
LeoVerification signal means the environment's evidence about whether the change behaved as intended.
MayaReview handoff means the branch, diff, summary, test evidence, and context a human uses to judge the work.
LeoGovernance means the rules for what traces can be stored, reused, trained on, shared, or kept eval-only.
MayaExactly. There is one more reason the foundation matters: the same words can hide different systems. "The agent fixed the bug" might mean the model guessed a one-line patch. It might mean a harness retrieved the right file and the model edited it. It might mean the system ran tests, failed, retried, and produced a clean pull request.
LeoSame headline, very different evidence.
MayaRight. Topic 1 gives us a vocabulary for asking better questions. What was the task? What repository state was used? What tools were available? What feedback did the agent observe? What was logged? What did a human review?
LeoIt also keeps us from over-crediting a single component.
MayaYes. If the checkout fix succeeds because the harness always opens the right schema file, that is not the same as the model independently discovering the schema. If the fix succeeds because hidden tests are strong, that is different from succeeding against a thin visible test.
LeoSo the expert habit is attribution discipline.
MayaExactly. Attribute gains to the right layer: model, interface, retrieval, execution environment, verifier, human review, or data collection. That habit is going to matter throughout the series.
LeoWhat would a weak Topic 1 understanding sound like?
MayaIt would sound like "the model is good at coding" or "the benchmark score went up" without asking what kind of coding, what tools, what checks, what repository context, and what failure modes were involved.
LeoAnd a strong understanding?
MayaIt sounds like, "This system improved on repository-level issue tasks because its file search, edit guardrails, and retry loop improved the trajectory, but we still need review labels to judge maintainability."
LeoThat is a much more useful sentence.
MayaIt also gives product teams a sharper acceptance test. A useful agentic system should make the work easier to inspect, not harder. If the demo only shows a polished final diff, the team still has to ask where the evidence lives.
LeoEvidence like the task, the repo state, the commands, the failed checks, the final tests, and the review handoff.
MayaExactly. That evidence is what turns agentic coding from a magic trick into an engineering practice. It lets teams debug failures, compare systems, build training sets, and decide where humans need to stay in the loop.
LeoSo Topic 1 is the vocabulary for responsible optimism.
MayaYes. It lets us be impressed by real capability without losing sight of the work system that produced it.
LeoIt is the sentence this topic is trying to make possible.
MayaWhen you watch a coding assistant handle a real repository, which part of the work is the model doing, which part is the harness doing, and which part is only visible because the trajectory was logged?
Source material
← Back to Agentic Coding Capability: From Coding Models to Coding Agents