This article is about tool choices—why Trunk converged on Python, dlt, dbt, DuckDB, and a handful of other defaults, and what breaks when a tool is awkward to automate or review.

The stack at Titanium Birch is intentionally boring on purpose: one primary language, a small set of data tools, automated formatters and type checkers, and infrastructure expressed as declarations rather than tribal knowledge. The goal is coherence—anyone opening a module should recognise the shape of the project, run the same commands, and get feedback in seconds rather than hours.

That bias toward predictability shows up everywhere. It is why Trunk stays mostly Python with a single lockfile, why the analytical database is in-process, and why the test suite has a hard time budget. The interesting question is not whether each choice is optimal in isolation, but whether the defaults stay legible when work is handed between people, agents, and CI without a bespoke setup ritual each time.

Key takeaways

  • The stack is designed around agent ergonomics—tools are chosen for what agents can use reliably, not for maximum feature depth
  • Trunk is Python 3.12+ with one lockfile, which keeps the environment consistentuv workspace mode reduces unnecessary context-switching
  • Tool boundaries are intentionally simple—dlt ingests, dbt transforms, and DuckDB stores
  • Code quality checks are automated—ruff, sqlfmt, and pyright enforce standards, so reviews do not spend time on style
  • The full test suite must stay under 60 seconds—agents can test as they work instead of leaving it until the end

One language, everywhere

Trunk’s codebase is Python 3.12+. Not because Python is the best language for everything, but because, in the current workflow, agents are more reliable in Python than they are in a mixed-language stack. The data ecosystem is Python-native. And a single-language monorepo means any agent can work on any part of the system without a context switch.

This matters more than it sounds. When an agent picks up a task—fixing a data pipeline, updating an analytical model, adding a new API integration—it doesn’t need to reason about which language the module uses, which build tool applies, or which dependency manager to invoke. Everything is Python, managed with uv in workspace mode, and resolved against a single lockfile, so the working context stays consistent across the codebase.

In practice, that means any engineer—or any agent—can move across the entire codebase without a language barrier. There is no “I don’t know that language” blocking access to any part of the system.

The trade-off is deliberate. A single-language stack gives up some local optimisation. There will be cases where a specialised language or tool would be stronger in isolation. So far, the team has chosen consistency and debuggability over that extra flexibility because both humans and agents move faster when the environment is predictable.

Data flows through two tools, not five

The data pipeline has two stages: dlt (data load tool) ingests from external APIs, and dbt (data build tool) transforms raw data into clean, modelled tables. The boundary between them is deliberately clear.

This design is deliberate. In TB’s experience so far, the workflow stays steadier when each tool has a well-defined job and the boundary between ingest and transform stays explicit. dlt pipelines are Python scripts—agents write them naturally. dbt models are SQL with Jinja templating—agents handle SQL well, and the declarative structure means they can reason about what a model does without tracing imperative code paths.

The team favours dbt’s native incremental strategies over custom SQL, and inline model configuration over centralised config files. These are not just style preferences. They make the code easier for agents to work with because an agent editing a model can find the relevant logic in one place instead of piecing it together from a separate config directory.

A database with no ops burden

DuckDB is the analytical database—an in-process SQL engine that runs from a local file. There is no separate server process to manage, no connection layer to configure, and no port conflicts to sort out. Agents can run queries directly instead of starting and troubleshooting a database first.

MotherDuck provides the cloud layer. Production data lives there. Staging environments get their own sandboxed databases, spun up per-PR and torn down when the PR closes. An agent working on a feature branch gets an isolated copy of the data—it can break things without affecting anything else.

This setup also avoids a whole category of failure modes that come from remote services and connection layers: diagnosing infrastructure failures. If the database is an in-process file, there’s no “is the server running?” debugging loop. The failure modes are simpler, and simpler failure modes mean agents recover faster.

Quality enforcement without opinions

Code quality is enforced by automated tools, not review discussions:

  • ruff handles Python linting and formatting
  • sqlfmt handles SQL and Jinja formatting
  • pyright handles type checking

Pre-commit hooks run all three before every commit. The code review process explicitly skips formatting and style—those are settled by tools, not debated by reviewers.

This is agent ergonomics at its most direct. Agents shouldn’t spend tokens arguing about line length or bracket placement. Neither should humans. The tools make those decisions once, and everyone moves on to substance.

In practice, that keeps review attention on logic, correctness, and design—the parts that still benefit most from judgement.

It is also a reaction to wasted effort. The team prefers to spend review time on substance rather than on formatting or environment issues—and the same applies to agent tokens. Codifying those checks in tools is less glamorous than adding another framework, but it has cut repeated style and environment churn in review and in local iteration.

Tests that run in under a minute

Tests run with pytest. The hard constraint is that the full suite completes in under 60 seconds. This is a rule, not an aspiration.

Why 60 seconds? An agent iterating on a fix might run the test suite five or ten times in a single session. If each run takes ten minutes, the agent either waits (wasting compute) or skips testing (shipping broken code). At 60 seconds, testing becomes part of the iteration loop rather than a costly interruption.

CI runs unit tests on every PR. Integration tests run against staging databases for non-draft PRs—fast feedback on logic errors, thorough validation before merge.

Fast tests are not just an agent concern. They make it more realistic for both humans and agents to run checks as they work instead of treating CI as the first real feedback.

Infrastructure as declarations

Infrastructure runs on Terraform and AWS, deployed through GitHub Actions. CI/CD workflows handle PR validation, staging deploys, production deploys, and sandbox teardown—scoped by change detection so only affected apps get tested and deployed.

Authentication uses AWS OIDC. No stored credentials anywhere. GitHub Actions assume IAM roles via federated identity.

These are standard choices, but they’re standard for a reason: declarative infrastructure and keyless auth are things agents handle well. An agent can modify a Terraform module, open a PR, and let CI validate the plan. It doesn’t need to manage SSH keys, rotate secrets, or remember which AWS account to target. The conventions do that work.

Tools that earn their place

The agent-ergonomics lens also shapes which tools get replaced. Marimo is replacing Jupyter for interactive notebooks. Jupyter notebooks are JSON blobs—hard to diff, hard to review, hard for agents to edit reliably. Marimo notebooks are pure Python scripts, version-controllable and testable like any other file. An agent edits a Marimo notebook the same way it edits application code.

Illustrative before/after. The snippets below are generic shapes, not copied from Trunk—they show what that difference looks like in practice. In Jupyter, a trivial two-line cell sits inside a JSON document; a small logic change rewrites unrelated structure around it, so diffs are noisy and easy to mis-edit:

{
  "cells": [
    {
      "cell_type": "code",
      "source": ["fee_bps = 12\n", "spread = notional * fee_bps / 10_000\n"]
    }
  ]
}

In Marimo, the same idea is ordinary Python: a reviewer (human or agent) sees a named function, imports at the top, and a patch that touches only the lines that changed. The file stays valid Python, so the usual formatter and tests apply:

import marimo

app = marimo.App()

@app.cell
def __():
    notional = 1_000_000
    fee_bps = 12
    spread = notional * fee_bps / 10_000
    return (spread,)

Playwright CLI handles browser-based testing and UI review. Agents use it to navigate pages, fill forms, and inspect accessibility—extending automated quality checks into territory that would otherwise require manual testing.

Both follow the same logic: if a tool creates friction for agents—opaque file formats, manual interaction requirements—replace it with one that doesn’t.

What we’re still tuning

Marimo replacing Jupyter is the clearest sign that the stack is still moving. The point is not a grand anti-notebook doctrine. The replacement followed the practical failure mode the before/after shapes above illustrate: opaque notebook JSON versus reviewable, agent-editable Python.

The same pattern shows up elsewhere. The 60-second test cap keeps the delivery loop healthy, but it also creates pressure about how much integration coverage fits inside that budget. The single-language rule keeps context simple, but it means any exception has to justify the cognitive load it adds. The open question is not “what is the best tool in the abstract?” It is “when is extra power worth extra complexity in an agent-heavy workflow?”

Boundaries that are still moving

Those boundaries are actively being shaped. The ongoing decisions include when a specialised tool genuinely earns the right to complicate the default, when a test is important enough to spend part of the 60-second budget, and which bits of operational friction should be removed with better tooling rather than tolerated as the cost of doing business.

That is the more interesting dimension of “the stack” here. The list of approved tools is not inherited as fixed. It stays honest by changing when the current defaults stop paying for themselves.

The coherence so far

The stack is not exotic. Python, SQL, Terraform, and GitHub Actions are all familiar tools. What stands out is that the choices are being made through the lens of an agent-heavy workflow, and that lens is still shaping the system in public.

So far, that has led to a coherent set of defaults: a single language makes it easier to move across the codebase, fast tests make it realistic to run them as part of the loop, automated formatting keeps reviews focused on substance, and declarative infrastructure reduces the amount of operational trivia anyone needs to remember. None of that makes the stack finished. It does make the trade-offs easier to inspect.

For context on the firm beyond this stack list, see titaniumbirch.com.