How a feature ships—Titanium Birch Labs

A walk through the six named phases in order—what each step is for, what artefact it leaves behind for the next step, and why skipping a step tends to cost more than it saves.

Key takeaways

Six phases—story refinement, technical design, implementation, code review, merge prep, and production feedback—each producing a specific artefact
Each phase consumes the last artefact—the pipeline is sequential on purpose, because skipping a step usually costs more time than it saves
Concrete examples below—a composite feature walked through start to finish, with real artefact structures
The process is version-controlled—it changes when the team learns something, just like code

The six-phase pipeline

Features that follow the agent-led path move through the same documented delivery pipeline. The phases are sequential—each one produces an artefact that the next phase consumes. No phase gets skipped without explicit human agreement.

Here’s what each phase actually produces, illustrated through a composite example: adding a new data quality check to one of TB’s data pipelines. The details are representative, not verbatim—patterns and process are shareable, specific business logic isn’t.

Phase 1: story refinement

A human has a rough idea: “We need data quality checks on the incoming market data pipeline so bad records don’t silently corrupt downstream analytics.”

An agent picks up this vague requirement and turns it into a buildable story through structured conversation. The agent asks specific questions, challenges assumptions, and produces a story document.

What the story document looks like

## Story: Add data quality checks to market data ingestion

### Context
The market data pipeline ingests daily feeds and loads them into
the analytics database. Currently there is no validation between
ingestion and load—malformed or missing records propagate silently.

### Requirements
- Validate schema conformance on every ingested record
- Flag records with null values in required fields
- Log rejected records with rejection reason to a quarantine table
- Fail the pipeline run if rejection rate exceeds a configurable threshold (default: 5%)
- Expose rejection metrics in the pipeline's run summary

### Acceptance criteria
- [ ] Pipeline rejects records missing required fields
- [ ] Rejected records appear in quarantine table with reason
- [ ] Pipeline fails when rejection rate exceeds threshold
- [ ] Pipeline succeeds when rejection rate is below threshold
- [ ] Run summary includes rejection count and rate
- [ ] Existing passing tests still pass

### Out of scope
- Retroactive validation of historical data
- Alerting or notification on pipeline failure (separate ticket)

### Open questions (resolved during refinement)
- Q: Should we quarantine or drop bad records?
  A: Quarantine—the team wants to inspect failures.
- Q: What's the threshold for failing the run?
  A: 5% configurable via environment variable.

The key characteristics: requirements are specific enough that an implementation agent can pick this up without unnecessary ambiguity. Acceptance criteria are checkboxes that can be verified by running tests. Open questions are resolved inline—the story does not leave avoidable decisions for the implementing agent to guess at.

Phase 2: technical design

A second agent reads the story document and produces a technical design. The document is a concrete plan covering what changes, where, and how the acceptance criteria will be verified.

What the technical design looks like

## Technical Design: Data quality checks for market data

### Changes

1. **New module: `src/pipeline/validation.py`**
   - `validate_record(record: dict, schema: Schema) -> ValidationResult`
   - `ValidationResult` dataclass with `is_valid`, `errors` fields
   - Schema loaded from existing config, no new config needed

2. **Modified: `src/pipeline/ingest.py`**
   - After fetch, before load: validate each record
   - Valid records proceed to load; invalid records go to quarantine
   - Track counts for run summary

3. **New model: `models/staging/stg_quarantine.sql`**
   - Columns: record_id, source, rejection_reason, rejected_at, raw_payload
   - Incremental model—append-only

4. **Modified: `src/pipeline/summary.py`**
   - Add rejection_count and rejection_rate to run summary output

### Testing approach
- Unit tests for `validate_record` with valid, invalid, and edge-case records
- Integration test: run pipeline with a fixture containing 2% bad records (expect pass)
- Integration test: run pipeline with a fixture containing 10% bad records (expect fail)
- Existing tests unmodified—they use fixtures without bad records

### Risks
- If the schema definition is incomplete, valid records could be falsely rejected.
  Mitigation: log warnings for unknown fields rather than rejecting.

The design names specific files, functions, and test scenarios. An implementing agent reads this and has enough context to start implementation with a clear scope. A human reviewer can approve the approach before any code is written—catching design mistakes before they become code mistakes.

Phase 3: implementation

An agent reads the story and design documents, checks the relevant AGENTS.md and skill files for conventions, and writes the code.

What this looks like in practice:

The agent reads the app’s AGENTS.md for project-specific context—what the app does, how to install dependencies, how to run tests.
It reads any relevant shared skills—in this case, the dbt-transformations skill for the quarantine model and the development-conventions skill for commit formatting.
It writes the code, following the patterns already established in the codebase.
It runs the test suite (hard cap: 60 seconds) and iterates until tests pass.
It opens a pull request with a structured description.

What the PR description looks like

## TBI-2847 Add data quality checks to market data ingestion

### What
Adds record-level validation to the market data pipeline. Invalid
records are quarantined with rejection reasons. Pipeline fails if
rejection rate exceeds a configurable threshold.

### Changes
- New `validation.py` module with `validate_record` function
- Modified `ingest.py` to validate before loading
- New `stg_quarantine` incremental dbt model
- Updated run summary to include rejection metrics
- 6 new tests (4 unit, 2 integration)

### Testing
All tests pass. Integration test with 2% bad records: pipeline
succeeds, bad records appear in quarantine. Integration test with
10% bad records: pipeline fails with expected error message.

### Notes
Schema validation uses the existing config schema definition.
Unknown fields log a warning but don't cause rejection—this
avoids false positives if the source adds new fields before
we update the schema.

The PR title includes the ticket number. The description follows a consistent structure that the reviewer knows how to read. The “Notes” section calls out design decisions that might not be obvious from the diff alone.

Phase 4: code review

A dedicated reviewer agent examines the PR. This isn’t the same agent that wrote the code—it’s a separate session with fresh context and a specific review focus.

The reviewer follows ten focus areas (readability, organisation, business logic, documentation, testing, security, performance, type safety, consistency, logging) and tags every comment with a MoSCoW priority.

What review comments look like

[S] Should—validation.py:42
The `validate_record` function catches all exceptions and returns
is_valid=False. This masks genuine bugs (e.g., a TypeError from
a code error) as validation failures. Consider catching only
ValueError and KeyError, and letting unexpected exceptions
propagate.

[C] Could—ingest.py:87
The quarantine insert happens inside the record loop. For large
batches, this could be slow. Consider batching quarantine inserts.
Not blocking—current batch sizes are small enough.

[W] Would—validation.py:15
`Schema` type hint is imported but could use a more specific
generic: `Schema[MarketDataRecord]` would make the expected
record type explicit.

The tags matter. The implementing agent knows that [S] items should be addressed and [C] items are optional. The human reviewer, who approves the final merge, can see at a glance whether the reviewer agent found any [M] blockers.

Phase 5: merge prep

After review comments are addressed, the implementing agent performs merge preparation—a structured check before the human gives final approval.

What the agent verifies:

All tests pass—the full test suite, not just the new tests
No merge conflicts—the branch is up to date with main
Review comments resolved—every [M] and [S] item addressed, with explanation
CI pipeline green—all automated checks pass (linting, type checking, formatting, tests)
Acceptance criteria met—each checkbox from the story document verified against the implementation

The human reviewer then reads the diff, checks the review conversation, and approves (or requests changes). This is the gate where human judgement matters most—the automated checks have passed, the reviewer agent has flagged issues, and the human decides whether the change is ready for production.

Phase 6: production feedback

After merge, the pipeline runs in production. The team monitors:

Did the pipeline succeed on real data?
Did the quarantine table receive the expected volume of rejected records (or an unexpected spike)?
Did downstream analytics behave correctly?

If something breaks, the feedback becomes the input for the next story. A bug fix follows the same six-phase pipeline—story, design, implementation, review, merge, production. The pipeline handles its own fixes.

Why the phases are sequential

It’s tempting to skip phases for “simple” changes. The rule at TB is that no phase gets skipped without explicit human agreement. Here’s why:

Story refinement catches ambiguity. Without it, the implementing agent guesses at requirements—and guesses wrong in ways that are expensive to fix after code is written.

Technical design catches architectural mistakes. A reviewer agent examines the design before implementation begins. It’s cheaper to change a plan than to change code.

Separate code review catches blind spots. The implementing agent and the reviewer agent read the same conventions but bring different perspectives. The implementer is focused on making it work; the reviewer is focused on making it right.

Merge prep catches integration issues. Tests pass in isolation but fail when merged. CI catches what local testing misses.

Production feedback closes the loop. The system only improves if the team learns from production behaviour. Skipping this phase means repeating mistakes.

What this means for engineers

The delivery pipeline also illustrates how engineering effort is allocated in this model.

The human work is not mainly writing code. It is helping define the process that agents follow, reviewing at gates where judgement matters, and changing the system when it proves too loose, too rigid, or simply wrong. The story refinement conversation, the design review, and the final merge approval are the moments where human expertise most clearly shapes what happens next.

The six-phase pipeline is the current structure the team is using to make that division workable, and like the rest of the operating model, it keeps getting revised as the team learns.