This piece stays on dependability and rough edges—what already feels repeatable, what still feels early, and which signals matter more than a glossy demo.

A shallow signal—whether a firm uses AI—says little about dependability. The sharper questions are where the workflow already feels repeatable, where it still throws surprises, and which failure modes matter once formal checks are green but the output is not quite right. The honest picture here is mixed: parts of the rhythm feel solid, and several important dimensions still feel early. The operating model describes roles and gates; this article stays on maturity—what feels sturdy in practice versus what is still being proven.

The short answer

The short version is that dependability shows up first in the mechanics of delivery: phases that hand artefacts forward, checks that run often enough to be trusted, and misses that get routed back into tickets and instructions rather than quietly patched around. The gaps show up in governance at scale, multi-team replication, and measurement—places where a five-person firm can be candid about not having a finished answer yet.

What feels mature in practice

Agents own the delivery cycle

Agents do not just help with coding. They handle story refinement, implementation, code review, merge preparation, and routine maintenance. The human sets direction, reviews the work, and decides when something should move forward.

That is a meaningful threshold for us. The point is not that TB has found the one correct model. It is that most of the delivery cycle is automated end-to-end in Trunk, and the interesting human work is increasingly about editing the briefs and guardrails that shape what “done” means before output piles up.

Verification happens throughout, not just at the end

Our delivery workflow runs through story, design, implementation, merge prep, production feedback, and retrospective. Each phase produces something the next phase can inspect. That matters because it reduces the common failure mode where a model produces a lot of output quickly, and the team discovers the real problems only after the work is already live.

We also use reviewer subagents to look at work with fresh eyes, and agents are not supposed to skip phases unless a human explicitly agrees. That sounds restrictive. It is also one of the main reasons the system remains usable.

The system remembers how we work

Rules and skills mean every agent session starts with the same institutional context. There is no separate onboarding ritual where somebody explains the team’s habits from memory. Conventions and constraints live in the repo alongside the code.

That changes how mistakes compound. When something breaks, the fix is not only in the code. The lesson gets written down in a form future agents will read before they start work.

What a verification gate actually looks like

A recent example in Trunk looked healthy on paper: the build was green, the pull request merged, and the change reached the environment as expected. If you stopped the story there, the pipeline would look reliable. The next review pass then flagged the work as technically valid but not sharp enough for production—acceptable on automated checks, but short of the bar the team had set for clarity or correctness in context. Nothing was broken in the mechanical sense; the output was simply not good enough.

That is a more revealing signal than a broken build. A build failure tells you the machine caught malformed work. A quality miss tells you the machine produced something that passed the formal checks but still missed the mark. The response was not a human quietly patching the branch off to the side. It was to route the miss back into the normal delivery system: capture the pattern, open a follow-up ticket, and tighten the brief or the gates so the next change is more likely to land right the first time.

From the human side, this feels less like firefighting and more like editing a living standard. You are not mainly asking, “Can I patch this output faster myself?” You are asking, “What failed here—the writing, the brief, or the gate—and what change would make the next agent more likely to get it right before anything ships?” That says more about the current maturity of the model than a slick demo ever could. The system does not only know how to pass a gate. It is learning how to treat a non-catastrophic miss as useful input for the next cycle.

How responsibility loads here

At TB’s scale, governance often looks like one human reading a pull request carefully and deciding whether a rule, a ticket, or a workflow needs to change. Agent work sits in the same engineering habits as everything else: pull requests, Linear tickets, and version-controlled skills loaded at the start of agent work.

Agents carry a large share of the delivery responsibility—implementation, review preparation, and routine maintenance—while humans set direction and edit briefs and gates. That loading matters because the failure modes are tied to hand-offs: missed intent despite green checks, schema drift across consumers, and lessons that need to land in repo-held instructions, not only in one-off fixes.

Maturity here shows up less in how autonomous an agent looks in isolation, and more in how calmly the team handles output that passes the gates but still falls short of what the brief asked for. The revealing case is when the system ships something that meets the formal checks but misses the intent, and whether the process turns that miss into a change to a brief, a gate, or a skill before the next cycle.

Where we are still early

Security and governance at scale. Our current approach is mostly rules-based: agents are told what not to log, sensitive parameters are filtered, and the human reviews diffs. That matches how a five-person firm can run delivery today; stricter compliance programmes and wider access boundaries are not fully encoded in automation yet.

Multi-team validation. So far the pattern has been validated on one team inside Trunk. It may scale cleanly across several teams. It may also become messy in ways we have not hit yet.

Measurement. We keep some hard operational constraints, like fast test loops and successful builds, but overall we have been better at building the system than measuring it.

In Trunk, the visible shape is structured workflow, verification, and institutional memory alongside open gaps on governance at scale and formal measurement. The useful signal is that pairing—not a single headline score.

What Trunk shows today

Trunk supports a serious agent-engineering effort—not because we can produce a flattering scorecard, but because the system has real workflow structure, real verification, and real institutional memory.

The model is not finished. There are gaps, and some of them matter.

Together, that is the picture: a working system rather than a slide deck, still young enough that the edges are visible. The same gaps—measurement, review depth, and how the model holds up beyond a five-person firm—are tracked as named threads in The open questions, where each one names what is still unresolved and how we think about it next.