T1E3 · Topic 1 · May 30, 2026 · 00:09:17

T1E3 · Trajectories as Training Data

This episode introduces trajectories as structured records of agentic coding attempts.

Transcript

MayaLast time we named the Agent-Computer Interface, the workbench that shapes what a coding agent can see and do.

LeoToday we ask what we should keep from an agent run once the work is over.

MayaA patch passes tests, but the agent got there by editing three unrelated files, reverting one change, rerunning the same failing test five times, and never reading the actual schema.

LeoThe final diff looks okay, but the journey is noisy.

MayaExactly. That journey is the object of this episode.

LeoPlain language: a trajectory is the run tape.

MayaYes. It is the record of what happened: the task, repo snapshot, messages, searches, file reads, commands, outputs, edits, test runs, failures, retries, final diff, and outcome.

LeoThe first landmark is the Run Tape.

MayaThe Run Tape lets us replay the work. For the express-checkout bug, it shows whether the agent searched for validation logic, opened the shared schema, ran the express checkout tests, and understood the failure.

LeoAnd if the express-checkout run skipped the schema entirely, the trace tells us the green test may be luck rather than sound localization.

MayaRight. The same final patch can represent careful diagnosis, accidental success, or overfitting to one visible check. The trajectory is how we tell those cases apart.

LeoWithout the tape, we only see the patch.

MayaRight. A patch can hide a lucky guess. A trajectory can reveal whether the agent built a useful map of the repo.

LeoHow does this connect to SWE-bench and Aider?

MayaSWE-bench makes final issue resolution measurable through repository tasks and tests. Aider's benchmark writeups show the value of second attempts after test feedback. Both point toward a larger idea: the path contains information that the final answer does not.

LeoThe second landmark is the Recovery Moment.

MayaThe Recovery Moment is where an agent receives negative feedback and either improves or spirals. A test fails. A command errors. A file is not where expected. The agent can revise its hypothesis, or it can repeat the same action.

LeoThis is important for training because recovery is a skill.

MayaExactly. If our dataset keeps only polished successful traces, we may remove the most instructive behavior. Real coding work includes being wrong, noticing, and correcting course.

LeoBut there is a danger. If we train on messy failures, do we teach the agent to be messy?

MayaThat is the core trade-off. One camp prefers successful trajectories because they are cleaner demonstrations. The strongest argument is that models imitate patterns, so noisy failed work can pollute the signal. The other camp wants failed and recovered attempts because they teach diagnosis, resilience, and what not to do.

LeoTheir strongest argument is that a model that never sees failure will not learn to recover from it.

MayaExactly. The third landmark is the Label Shelf. Raw traces are useful, but labeled traces are much more useful. We want outcome labels, failure categories, reviewer comments, hidden-test results, safety notes, and maybe step-level labels.

LeoIn the checkout run, labels might say "initial localization wrong", "test output correctly interpreted", "final patch minimal", "review accepted."

MayaYes. Those labels turn a transcript into a training and evaluation asset. They let teams ask sharper questions: Is the agent weak at search? At test interpretation? At respecting existing design? At stopping after enough evidence?

LeoThe fourth landmark is the Governance Gate.

MayaExactly. Trajectories can contain private code, customer data, secrets in logs, proprietary issue text, or benchmark material that should never leak into training. Keeping more data creates more value and more responsibility.

LeoSo "log everything" is not a mature policy.

MayaRight. Mature trajectory data needs redaction, secret scanning, license review, train/eval split controls, retention rules, and clear data-use restrictions.

LeoGitHub Copilot's cloud-agent documentation is relevant here because product workflows expose logs and session state for users to inspect.

MayaYes. Product logs are not automatically training data, but they show what observable agent work can look like: a task, a branch, session events, diffs, tests, and handoff to review.

LeoWhat makes a trajectory replayable?

MayaThe environment must be recoverable. You need the repo commit, dependency setup, test commands, tool versions, and enough context to rerun or inspect the attempt. A screenshot of a patch is not replayable. A task plus repo snapshot plus environment plus trace is much closer.

LeoThat connects to the final curriculum's central data unit.

MayaYes: task, repo snapshot, executable environment, trajectory, verifier signal, human review, failure labels, and governance metadata.

LeoGive me a concrete failure mode that only the trajectory reveals.

MayaSuppose the final patch adds backend validation and passes tests. The trajectory shows the agent first edited the frontend, ran unrelated tests, ignored a failing backend test, then accidentally found the schema. A reviewer might accept the diff, but a training team should learn that the localization process was weak.

LeoAnother one: the agent passes the test by hard-coding the express checkout path, which works for the benchmark but violates design.

MayaExactly. The final pass label is positive, but the trajectory and review labels should flag brittle reasoning.

LeoSo trajectories are not just debugging logs. They are evidence about capability.

MayaYes. They show process quality, not only outcome quality. That is why this topic matters before we talk about evaluation or data products.

LeoThe question is no longer "Did the patch pass?" It is "What did the agent do to earn that patch?"

MayaExactly. And there is a practical way to evaluate a trajectory: look for decision points. Did the agent choose a file because of evidence, or because the filename sounded plausible? Did it run a test because that test matched the changed behavior, or because it was the only command it remembered?

LeoSo a trajectory is not just a chronological log. It is a sequence of decisions under uncertainty.

MayaYes. That is why timestamps and raw commands are not enough. We also want semantic labels: localization, hypothesis, edit, verification, recovery, handoff. Those labels help teams compare runs across different repos and tasks.

LeoOtherwise every trace is a snowflake.

MayaExactly. A shared action taxonomy turns traces into data. It lets you ask, for example, whether failed runs spend too long in search, skip reproduction, over-edit after a small failure, or ignore review comments.

LeoAnd those patterns become training or product improvements.

MayaRight. If many agents fail by misreading terminal output, improve the feedback format or train on better output interpretation. If they fail by patching symptoms, strengthen the localization stage. If they pass tests but receive negative reviews, add review labels to the data product.

LeoThis is where the data record starts to look bigger than a transcript.

MayaYes. The record needs layers: task, environment, trajectory, verification, review, failure, safety, and governance. Topic 5 will formalize that, but the seed is here.

LeoAnd the seed is simple: collect work traces, not just code.

MayaBut collect them with purpose. A trace that nobody can search, label, replay, or govern becomes storage cost. A trace with clear fields becomes evidence.

LeoSo the quality of the data product begins while the agent is working, not after someone exports logs.

MayaExactly. The run design determines the data you can learn from later.

MayaIf you were building a training dataset from coding-agent runs, which failed attempts would you keep because they teach recovery, and which would you discard because they only teach noise?

Source material

← Back to Agentic Coding Capability: From Coding Models to Coding Agents