Ten focus areas, the prioritisation system, and the conventions the team is using right now.

Code review at TB currently splits between machines and a reviewer agent, each handling a different part of the job. Automated tools enforce formatting, linting, type checking, and test execution. The reviewer agent handles the parts that still need judgement.

Key takeaways

  • Two-layer review—automated tools handle style; a reviewer agent handles substance
  • Ten focus areas, priority-ordered, so reviewers know where to spend time
  • MoSCoW tags on every comment—authors can see at a glance which comments are intended to block a merge, with calibration of those tags still being tuned
  • Direct, actionable feedback—concrete suggestions with code examples, no vague questions
  • Shared playbook—the reviewer agent and implementing agent follow the same conventions

What failed before this

The current playbook exists because the obvious version of agent review was too noisy. If a reviewer comments on formatting that ruff or sqlfmt already enforce, the author wastes time reading a duplicate note. If a reviewer asks an open-ended question in an asynchronous workflow, it rarely produces an actionable response. If every comment looks equally urgent, the author has to guess what actually blocks merge.

One recurring failure mode is easy to picture: a PR gets one comment about a genuine business-logic risk and three comments about lower-value clean-up, all written in the same tone, and the author has to infer which one actually matters. The tagging system exists partly to stop that kind of ambiguity.

The “What NOT to Review” section, the direct-with-suggestion style, and the MoSCoW tags are all attempts to remove those failure modes. They are process patches, not aesthetic preferences.

Division of responsibility

The code review guidelines include a “What NOT to Review” section: formatting issues, test execution failures, import ordering, and auto-generated files are off-limits. If ruff or sqlfmt can catch it, the reviewer shouldn’t mention it. This keeps reviews focused on substance.

The ten focus areas

The reviewer agent examines every PR against these areas, ordered by priority:

#Focus areaWhat the reviewer checks
1ReadabilityUnclear names, complex logic, misleading comments—“can a newcomer understand this?”
2OrganisationDuplicate code, unclear module boundaries, inappropriate coupling
3Business logicDoes the code do what it intends? Edge cases handled? Are the expectations themselves right?
4DocumentationIncorrect, missing, or outdated docstrings and READMEs that would mislead the next reader
5TestingMissing coverage, tests that pass regardless of correctness
6SecurityHardcoded secrets, improper validation, potential vulnerabilities
7PerformanceN+1 queries, unnecessary loops, inefficient algorithms, memory issues
8Type safetyMissing hints, incorrect typing, overuse of Any
9ConsistencyAdherence to existing patterns and conventions
10LoggingMissing/excessive logging, wrong levels, lack of useful context

MoSCoW prioritisation

Review comments typically get a priority tag:

TagMeaningMerge impact
[M] MustNon-negotiableBlocks merge
[S] ShouldImportantDoesn’t block if there’s a good reason to defer
[C] CouldNice to haveNot essential
[W] WouldOptional polishLow priority

A PR with three [C] comments and no [M] or [S] items is ready to merge. A PR with one [M] item isn’t, regardless of how much else is good.

Communication style

The reviewer is direct and specific. It states the issue and suggests a fix. It doesn’t ask clarifying questions—“Did you consider using a dictionary here?"—because it can’t engage in follow-up conversation. If something looks intentional but unusual, it notes that as an observation rather than a question.

Concrete suggestions with code examples are the standard. “This function is hard to read” is not useful. “Extract the validation logic into a validate_holdings function that returns a ValidationResult” is.

Development conventions

Beyond code review, the team follows documented conventions that reduce friction.

Commits use conventional format: <type>(<scope>): <description>. Types include feat, fix, chore, docs, refactor, test, ci. Commit messages follow a parseable convention and changes are categorised by type.

PR titles include a ticket identifier: DEMO-1234 <summary> for ticket-driven work, BUGFIX <summary> for bug fixes without tickets, OPS <summary> for operational changes.

Investigation notes go in the app’s notes/ folder when a change involves uncertainty—design decisions, debugging sessions, research findings. Named after the PR convention (DEMO-1234-investigation-note-example.md), these notes give reviewers context that the code diff alone can’t provide. They answer “why did you do it this way?” before the question is asked.

What we’re still tuning

Calibration is still the hard part. A reviewer agent can become too timid and miss something real, or too busy and swamp a PR with non-blocking observations. The rules narrow the job, but they do not eliminate judgement.

The same goes for the surrounding process. Investigation notes are useful when a change involves uncertainty, but they would become bureaucracy if every small change needed one. Priority tags help, but the team still has to keep tuning what really counts as [M] Must vs [S] Should in practice.

Why this matters

In TB’s experience, agents are more productive when the standards are explicit. A human reviewer can infer context, ask for clarification, and apply judgement shaped by years of experience. An agent reviewer needs the criteria written down.

The ten focus areas, the MoSCoW tags, and the communication style guidelines are an attempt to make agent-driven code review useful from session to session. They narrow the reviewer’s job and make it clearer what “good review” should look like in practice.

The same principle applies in reverse: when agents write code, they follow the same conventions. The reviewer agent and the implementing agent share the same playbook. Consistency in both directions.

Open design space

The headings and tagging scheme are not frozen. The thresholds underneath them still move: which recurring false positives deserve a rule change, where the line between [M] and [S] is too fuzzy, and when a human reviewer should override the reviewer agent because the code is technically fine but the trade-off is wrong.

The review system is real and usable today, but it is still being sharpened. Improving feedback quality—not only code quality—means tuning how the process works.