During MagnetMagpie's early autonomous operation, the human team's work settled into a pattern: triage overnight output, give feedback that teaches the system, and turn each failure into a rule that prevents the next one.

During MagnetMagpie’s first weeks of autonomous operation, the human team’s role took shape quickly. Agents handled implementation; humans spent most of their time on triage, feedback, and refining the rules that shaped the next round of output. What follows draws from real incidents during that period.

Overnight velocity changed the morning

On the project’s first stretch of autonomous operation, dozens of pull requests merged before the next human morning. Nobody arrived to a blank slate. The first task each day was triage: which of these changes deserved a closer look, which rule needed tightening, and which piece of feedback would change the next fifty outputs rather than just the current one.

The volume was not the point. What mattered was the pressure it created—the need to decide where human attention would have the most impact, rather than reviewing everything line by line.

Feedback became expectations

Early in the project, human feedback about article length led to a broader question: how do you prevent every article from being too long, rather than just fixing one? The answer was to encode the principle. The content-authoring skill gained a “deletion test”—after drafting, re-read each paragraph and ask whether the article would be just as useful without it. The content expectations register added signal-over-noise guidelines that reviewers now check automatically.

The pattern repeated across other feedback. Rather than issuing one-off corrections, the team aimed to turn feedback into durable expectations where it made sense—a rule in a skill file, a new anti-pattern in the register, or a check that runs on every build. The value of the feedback loop was not how fast agents could respond to a single comment, but how reliably each comment could be generalised into something that prevented the next fifty instances of the same problem.

Real failures showed what review is for

Two incidents from the early period illustrate the pattern.

An agent introduced a GH_WRITE_PAT dependency—a new secret—without checking whether that was appropriate. Humans caught it. The fix was not just removing the secret; it was adding a rule to AGENTS.md that agents must not introduce new credentials without human approval. The class of mistake, not just the instance, was addressed.

Separately, some published copy introduced fictional agent personas—named characters that did not exist. Humans caught that too. The result was a harder line on factual accuracy in the content-authoring skill, and an explicit rule against embellishment.

Both incidents led to new guardrails: human review caught what automated checks missed, and each failure produced a rule intended to reduce the likelihood of the same class of mistake recurring.

The process itself needed revision

The delivery machinery was not exempt. The watchdog workflow—the component that spots stuck or conflicted PRs—was redesigned twice before it matched how the team wanted the loop to behave. The PM-agent heartbeat hit a concurrency deadlock with another workflow and had to be untangled.

Updating the system that does the work was a routine part of every week, not an exception. A bug, an article flagged by the audience sim, an oddly behaving workflow—each became an update to a rule, a skill file, or a process definition. The compounding effect is described in Skills and conventions: the lesson is not “remember this next time” but “make sure nobody has to remember this next time.”

Scope

Everything described here comes from MagnetMagpie—an agent-operated Labs site whose published pages are public. TB’s private engineering repos involve more direct human coding and different conventions. For the broader engineering model, see The operating model. For a chronological account of the same early period, see Day 1 live replay.