How testing works—Titanium Birch Labs

Most of the leverage in testing here is not pytest syntax—it is how sharply behaviour is defined before anyone writes an assertion, and how fast the suite stays so checks stay inside the edit loop.

Testing here earns its keep when the intended behaviour is specific enough that a failing check means something—and when the suite is fast enough that running it is cheaper than guessing. A lot of the day-to-day work is therefore editorial: tightening definitions of done, deciding which regressions deserve a permanent guardrail, and noticing when a green build is answering the wrong question. The operating model describes how delivery is organised overall; this article stays on the testing slice of that picture.

Key takeaways

Tests-first means intended behaviour is decided before implementation—not written in after the code already “basically works”
Humans own the behaviour; agents usually own the mechanics of turning that behaviour into tests, code, and repeated green runs
Fast feedback is a hard requirement—in Trunk, the full test suite is kept under 60 seconds so testing stays inside the implementation loop
CI is a backstop, not the first moment of truth—agents are expected to run checks while they work, not wait for a pull request to discover the answer
The testing model is still being tuned—especially around integration coverage, exploratory work, and which checks should become permanent guardrails

Tests-first starts before pytest

The easy misunderstanding is to picture a human writing a wall of test code, tossing it over the fence, and waiting for an agent to make it pass. TB’s current workflow is more practical than that.

The first test artefact is usually a sharper definition of behaviour: a better ticket, clearer acceptance criteria, an explicit edge case, or a story document that leaves less room for guessing. In other words, the human defines what “correct” means in a form an agent can turn into executable checks. That still counts as tests-first, because the intended behaviour exists before the implementation does.

This matters because when the target is clear, iteration tends to move quickly; when the target is fuzzy, tests and outputs can drift without surfacing a crisp failure. A vague requirement produces vague tests, which produces false confidence. So a lot of the human testing contribution happens before any assertion is written. The human notices the missing edge case, the ambiguous failure mode, or the acceptance criterion that sounds plausible but would not actually prove anything useful.

In practice, that feels less like “please write more unit tests” and more like “make this requirement specific enough that the tests cannot lie”.

What the agent does with that brief

Once the story is sharp enough, the implementing agent takes over the mechanical loop. The pattern described in the operating model, delivery workflow, and tech stack is straightforward:

Read the story, design, and repo instructions.
Write or update the relevant executable tests first, wherever practical.
Run the checks and watch them fail for the expected reason.
Change the implementation.
Run the checks again.
Repeat until the behaviour and the tests line up.

The important bit is not the order on paper. It is the fact that implementation and testing are treated as one loop rather than two departments. The agent is not expected to “finish the feature” and then think about validation afterwards. The checks are part of the work.

In Trunk, that loop is designed to stay fast enough to be worth repeating. The full pytest suite has a hard under-60-second budget. That rule exists because an agent may run the suite many times in one session. If each run is slow, either the feedback loop drags or the agent starts skipping checks. Neither is acceptable.

What counts as a test

At TB, “testing” is broader than a unit test file.

In the production engineering repos, the main automated checks are ordinary software checks: pytest suites, formatting, linting, type checking, and CI runs on pull requests. Some of that is there to prove behaviour, some to catch mechanical errors cheaply, and some to keep review attention on substance instead of style.

Smaller or specialised repositories still follow the same idea: pick automated checks that match the risks of that codebase, and run them early enough that the implementing agent can fix failures before merge. The exact harness differs by project; the habit does not.

Tests-first at TB does not mean every repo must look identical. It means every repo needs an automated way to tell the truth about whether the change did what it was supposed to do.

What CI is actually for

CI is not supposed to be the first time anyone discovers whether the change works. It is the second line of defence.

The first line is the agent’s local loop: run the relevant checks, fix the failure, rerun, and only open a PR once the branch is already in a healthy state. The second line is CI on the PR. In Trunk, unit tests run on every PR, and integration tests run for non-draft PRs against staging databases. The job is the same: independently verify what the branch claims.

After that, testing continues through review. A reviewer agent can still flag weak coverage or a mismatch between the stated acceptance criteria and what the tests really prove. The human reviewer then looks at the whole picture: not merely whether the checks are green, but whether the checks are pointed at the right thing. A fully green PR can still be unsatisfying if the tests missed the real risk.

That layering matters because each part catches a different failure mode. Fast local runs catch ordinary mistakes cheaply. CI catches integration and environment problems. Review catches the subtler issue where the code passes the formal checks but still solves the wrong problem.

What the human role feels like in practice

The human role in testing is not mainly to police whether anybody remembered to run the suite. The system is supposed to make that the default.

The more interesting human work is deciding what deserves a check in the first place. Which edge cases are worth spending part of the 60-second budget on? Which regression is important enough to turn into a permanent test rather than a one-off review comment? Which class of failure should block the line entirely, and which should surface as a softer signal for review?

This is also where the process stops pretending to be finished. The operating model is explicit that humans still prototype unfamiliar work before agents take over. Testing follows the same rule. If a feature is too new, ambiguous, or exploratory, the first job may be to learn the boundary manually and only then encode it. Tests-first does not remove that judgement call. It moves it earlier, where it is cheaper.

So the day-to-day experience is a bit editorial. Tightening definitions of done. Reading a failing check and asking whether it exposed the right problem. Noticing when a green build feels too reassuring and deciding the suite needs a new case. Less time performing the repetition, more time improving what the repetition is actually measuring.

Where the testing model is still evolving

Three tensions show up repeatedly.

First, there is a real trade-off between speed and coverage. The under-60-second rule keeps the loop healthy, but it also forces sharper choices about which integration scenarios belong in the default path and which belong in slower, later-stage checks.

Second, the team is still deciding how much testing knowledge should live in permanent instructions vs in the current ticket. Some lessons deserve to become rules in AGENTS.md or a skill file. Others are specific to one change and would just add noise if promoted to global policy.

Third, there is the boundary between execution and exploration. Tests are excellent when the intended behaviour is already understood. They are less helpful when the real task is figuring out what “good” even looks like. TB has a working answer for that today—humans step in earlier on novel work—but that boundary is still being refined rather than declared solved.

What makes this testing culture distinctive

The testing culture feels familiar in one sense and unusual in another.

The familiar part is that automated checks matter, CI matters, and green tests are expected before merge. The unusual part is where the leverage sits. The main human contribution is not writing implementation fastest and tacking tests on afterwards. It is making the intended behaviour clear early, improving the checks that guard it, and keeping the feedback loop honest as the agent workflow evolves.

That is what tests-first looks like in an agent-driven team. Not purity, not ceremony, and not an abstract devotion to TDD. A practical division of labour: humans define what should be true, agents do the repeated work of proving and implementing it, and the system gets revised whenever those checks stop telling the truth.