The open questions—Titanium Birch Labs

Where the documented system has clear edges—honest gaps in measurement, review policy, and operating boundaries rather than a polished closure.

The articles in How we engineer describe how engineering work is organised at Titanium Birch. This page lists design tensions that are still visible in the repo and in AGENTS.md—not a backlog of feature ideas, but places where the model has a clear edge.

How do we measure whether the system is actually improving?

Some signals are binary: CI passes or fails; nav tests pass or fail. Audience Simulation runs violation-based checks against the content expectations register (docs/content-expectations.md), emitting structured PAGE_REVIEW: and PUBLIC_READINESS: lines in GitHub Actions logs when CURSOR_API_KEY is configured. Those gates shape day-to-day work.

The harder question is systemic improvement. If a skill changes, do later outputs improve in a durable way, or does the team move variance somewhere less visible? If gates tighten, does quality gain justify loop latency? TB already uses fast tests, CI, audience simulation, and explicit retrospectives—none fully answers the causal question. Designing feedback that is sharp without mistaking easy counts for the whole story remains open work.

How much review should stay with humans?

In principle: agents implement; humans define process, set priorities, and review at gates (docs/operating-model.md, AGENTS.md). In practice the boundary is fuzzy. Day one on Labs showed fast automation plus human catches on fictional copy and on a disallowed secret dependency.

Automated checks handle mechanical correctness; a reviewer agent examines pull requests; humans intervene where judgement matters. Review too much and the queue recentralises on people; review too little and errors that pass checks still ship. A live question is which judgement categories belong in better rules versus explicit human steps for longer than anyone prefers.

How do we stop skills from becoming a second stale codebase?

Skills are version-controlled files agents load at session start—see Skills and conventions. That compounds good lessons but also replays drift at scale if a skill disagrees with the code or workflow it describes.

Mitigations already exist: metadata, cross-references, archives, and health guidance in agent docs. The open design space is stronger maintenance across more repos and subtler exceptions—whether some guidance should sit closer to code, how far to test skills directly, and how to stay short yet precise.

How fast does a human–agent loop need to be?

TaxToucan is TB’s internal workflow for regulatory-filing automation: humans still review uncertain extractions, and it only works if that review keeps up. The human side of early operations treats durable improvement as teaching the system, not patching one output. Both depend on turnaround, clarity, and proportionate effort.

TB has tried thresholds, resubmit flows, caching, explicit feedback channels, and short test loops. “Fast enough” still varies by work type—content correction, data review, and pipeline failure triage do not share one number. When loops feel slow or opaque, reviewers report routing around them in #magpie-feedback; the engineering problem is queue design, interfaces, and instrumentation so feedback actually changes behaviour.

What should happen when output is technically valid but unsatisfying?

A failing test is simple: fix and reship. Weak output that passes checks is harder. Audience simulation can label an article publishable while specificity lags; the response pattern is to feed that back as tickets, tighter briefs, or updated rules—not silent rewrite off-repo.

The tension is how far to push metric-shaped gates before the process optimises the gate instead of the work. Encoding taste without pretending it is only checklists is still unresolved.

Where does the model stop working, and how should work cross that boundary?

Public material describes execution-shaped delivery in detail; open-ended exploration is described less and is more often where humans take the pen. Novel design, ambiguous requirements, and unfamiliar integrations still pull humans closer—prototype, refine tickets, decide when a phase must not be skipped.

The open part is early detection: late hand-off wastes agent time on bad framing; early hand-off overloads humans with work agents could carry if the brief were sharper. How a feature ships forces refinement and design before implementation but does not remove the judgement call.

How would this model hold up beyond a five-person firm?

Titanium Birch has five humans; a small team can change rules quickly without heavy governance machinery. AGENTS.md notes security and governance at larger scale are largely unproven here.

Foundations exist—keyless GitHub Actions auth, explicit secret policy, documented review gates—but a stricter environment might need different approval paths, audit trails, or separation of duties. That is both compliance and product design for the operating model: add governance without destroying the clarity that makes the current loop workable.

Why list these at all

They are the same questions attached to real workflows and skills, not abstract thought experiments. The edges above are where the next changes to rules, gates, and measurement will matter.