The operating model—Titanium Birch Labs

This is the article that spells out TB's engineering organisation in one place—who owns delivery, where review happens, and how the repos are arranged—before the other pieces in this section zoom in on specifics.

This article is the home for a single question: how engineering is organised at Titanium Birch—who owns delivery, where judgement enters, and how the repositories reflect that split. The other pieces under How we engineer zoom in on tooling, testing, skills, and the delivery pipeline; they assume you can jump back here when you need the full picture.

Titanium Birch is a small investment firm—five people and a rotating cast of AI agents. The engineering team builds financial tools: data pipelines, portfolio analytics, investment workflow automation.

That is not a metaphor, though it is still a work in progress. In the parts of the system that are working today, agents do most of the implementation work: they write code, run tests, and open pull requests. The humans set direction, define process, and review at gates.

Key takeaways

Agents are the engineering staff, not assistants—they write code, run tests, and open PRs
Five humans define process, set priorities, and review at gates
Trunk is the main monorepo for production apps, with standardised layouts—standalone repos exist for specific purposes
Documented conventions (rules + skills) do much of the onboarding work—every agent session starts with structured context
Today the model is applied most in well-defined domains like financial data pipelines; exploration still needs humans

What took a few iterations to learn

The headline version—agents build, humans steer—is true, but it only became workable once the process spelled out boundaries, review gates, and conventions in writing rather than leaving them implicit. If an important boundary lives only in somebody’s head, agents will cross it differently from session to session. If every lesson becomes a rule, the system gets heavy and the humans spend their time feeding the machine instead of improving it.

The current model is the team’s present compromise. Humans still prototype the unfamiliar parts first, then turn what they learn into rules, skills, and review gates. That is slower upfront than letting every agent rediscover the same pattern. It is also cheaper than paying for the same mistake over and over.

Who does what

	Humans	Agents
Delivery workflow	Design and update the pipeline	Follow it end-to-end
Rules & skills	Write and maintain documentation	Read and apply every session
Code & testing	Occasional prototyping	Tests-first implementation in small fast loops (< 60 s test cap)
Code review	Approve PRs, judgement calls	Dedicated reviewer agent on every PR
Priorities	Own the roadmap and backlog	—
Maintenance	Approve proposed changes	Weekly automated dependency & doc updates

What the humans do

The humans maintain the system that makes agents productive.

Define the delivery workflow. Every feature follows a structured pipeline: from rough backlog item through story refinement, technical design, implementation, code review, merge prep, and production feedback. The humans designed this pipeline and update it when it breaks down.
Write the rules and skills. Agents follow documented conventions stored in version control. When an agent makes a mistake that a future agent shouldn’t repeat, the humans update the documentation. When a new tool gets adopted, the humans write the skill guide.
Review at gates. The humans read diffs, approve PRs, and make judgement calls that require business context. They don’t rubber-stamp—they catch the things automated checks can’t.
Set priorities. What gets built, in what order, and why. The humans own the roadmap and the backlog.

The humans rarely write code. When they do, it’s usually to prototype something that requires tight feedback loops—a new integration, a UI experiment, an unfamiliar API. Once the pattern is established, agents take over.

What the agents do

Agents handle the engineering work.

Story refinement. An agent interviews the human to turn a vague idea into a concrete, buildable story with acceptance criteria. The agent asks specific questions, challenges unclear requirements, and produces a story document that the next agent can pick up without ambiguity.
Implementation and testing. These aren’t separate phases—they’re one integrated loop. Tests describe intended behaviour; implementation makes it so. Wherever practical, tests get written first, agents watch them fail, then make them pass. Repeat many times in small fast loops. The test suite has a hard constraint: it must complete in under 60 seconds, which keeps the loop tight. Agents read the relevant AGENTS.md files for context, follow the tech stack patterns, and work within conventions defined in the repo’s skill files.
Code review. A dedicated reviewer agent examines every PR, using MoSCoW prioritisation to flag issues. The reviewer focuses on what automated tooling can’t catch: business logic correctness, code organisation, naming clarity, test coverage gaps.
Maintenance. Automated weekly runs handle dependency updates, reference doc refreshes, and process doc syncing. These produce PRs for human review—the agents propose, the humans approve.

How the repos are structured

The main monorepo is called Trunk—it holds the production apps, shared libraries, and infrastructure. Standalone repos exist for specific purposes—for example, isolating a public-facing site or a focused tool from the main monorepo. Trunk is where most engineering work happens, and its structure reflects the operating model:

Trunk/
├── apps/              # Five applications (each a uv workspace member)
├── libs/              # Shared Python libraries
├── envs/              # Terraform environments
├── .github/
│   ├── workflows/     # CI/CD pipelines
│   ├── actions/       # Reusable setup actions
│   ├── skills/        # Shared agent skills
│   └── instructions/  # Agent-specific guidelines
├── AGENTS.md          # Monorepo-level agent instructions
├── pyproject.toml     # uv workspace root
└── Makefile           # Root-level targets

Each app follows a standard layout: its own pyproject.toml, Makefile, AGENTS.md, source code, tests, and optionally a .agents/skills/ directory for domain-specific knowledge. The consistency matters because it gives an agent a fighting chance of finding instructions, running tests, and following conventions without relearning the repo from scratch.

What we’re still figuring out

The cleanest open question is where execution ends and exploration begins. Financial data pipelines, repeatable integrations, and routine product changes fit the agent-heavy model well. Novel product decisions, ambiguous requirements, and cross-cutting architecture changes still pull humans back in earlier. That boundary is real, but it is not fixed.

The second open question is how much process to encode. Every new rule can prevent a repeat mistake. Every new rule can also make the workflow slower and more brittle. The team is still tuning which lessons belong in AGENTS.md or skills, which belong in a ticket or review note, and which should stay as human judgement at the gate.

There is also a public/private boundary to manage. TB can describe the delivery model, the repo structure, and the engineering trade-offs in public. It should not publish proprietary business logic, portfolio details, or other sensitive context. Part of operating this model in the open is deciding how to stay candid without getting careless.

Why this works (and where it doesn’t)

The model works because the problem space is well-defined. Financial data pipelines have clear inputs, outputs, and correctness criteria. The apps have well-documented business rules. The humans can express what they want in structured requirements, and agents can verify their work against automated checks.

It doesn’t work for everything. Novel design decisions, ambiguous product requirements, and cross-cutting architectural changes still need human judgement. The model handles execution well; it handles exploration less well.

The humans spend most of their time on the parts that require exploration—and then they encode the results so agents can execute.

The operating model isn’t static. The delivery workflow, the skills library, the conventions—all of it is version-controlled and all of it changes as the team learns what works. The system improves by the same mechanism that builds the product: someone writes a better process, it gets reviewed, it gets merged, and every agent reads it on the next run.

Where the model is still being shaped

This operating model is not preserved in amber. The open questions include where it is too loose, where it is too rigid, and where a human should step in earlier. That might mean tightening a review gate, simplifying a skill that has grown too bulky, or deciding that a certain class of work still needs a human prototype before agents take over.

The work is not just operating a system with agents—it is also defining what a sensible agent system looks like for a small investment firm building financial tools. The current model is real, but it is still young enough that it changes noticeably as the team learns.