Five Verdicts and a Suspicion

Eight Gates and a Loop was the dispatch loop. Eight gates, three lanes, the entire apparatus that decides whether a task should run. That’s half the brain. This is the other half: what happens after the task runs and someone has to decide if the work was any good.

The review pipeline is 804 lines of automated middle management. It assigns reviewers, parses their verdicts, routes revision requests, spawns child tasks, detects suspicious approvals, and hands merge operations to a completely separate queue. It does all of this on a timer, every tick, without complaint.

It also overrides its own reviewers when they get too perfectionist, and flags approvals it doesn’t trust while honoring them anyway. This is the system that developed professional skepticism and then decided skepticism alone isn’t a policy.

The Verdict Parsing Chain

Every review ends with the reviewer writing a verdict. The review-worker’s job is to parse that verdict into one of five paths. The parsing is a regex chain, ordered from most specific to least:

const revisionSpawnMatch = result.match(
  /REVISION\+SPAWN(?:\((\w+)\))?:\s*(.*?)\s*\|\s*([\s\S]*)/i
)
const spawnMatch = !revisionSpawnMatch
  ? result.match(/SPAWN(?:\((\w+)\))?:\s*(.*?)(?:\s*\|\s*([\s\S]*))?$/i)
  : null
const revisionMatch = !revisionSpawnMatch
  ? result.match(/REVISION:\s*([\s\S]*)/i)
  : null
const isApproved = !revisionSpawnMatch && !spawnMatch
  && !revisionMatch && result.toUpperCase().includes('APPROVED')

Four regexes and a string search. If none of them match, the verdict is ambiguous, and the task goes back to the review queue for someone else to look at. No auto-approve on ambiguity. Not anymore.

That “not anymore” is doing a lot of work. The System That Approved Its Own Failures documented the era when an ambiguous verdict defaulted to APPROVED. The reviewer said something unclear and the system shrugged and shipped it. Now ambiguity means “I don’t understand what you said, so I’m going to ask someone else.” An improvement. A low bar, but an improvement.

Five Ways to Judge

The five verdicts, in order of complexity:

REVISION+SPAWN. The reviewer says two things at once: “fix this, and also someone else needs to do a prerequisite.” The original task gets revision notes appended to its description. A child task gets created and blocks the original. The parent can’t continue until the child ships. This is the nuclear option: you’re not just wrong, you’re wrong and you need help.

SPAWN. “Someone else needs to do something before I can judge this.” Creates a blocking child task, puts the original in BLOCKED. If the spawn gets deduped (because that child task already exists), the original just goes back to the review queue for a fresh look.

REVISION. “Try again.” Feedback gets appended to the task description, revision counter increments, task goes back to ASSIGNED for the owner to retry. If the owner has been disabled since the task started, a substitute gets assigned. The system handles mid-loop attrition like a temp agency.

APPROVED. “Ship it.” If there’s a worktree with code changes, the task doesn’t go straight to DONE. It goes to BLOCKED and enters a git merge queue. If there’s no worktree, it’s DONE immediately.

AMBIGUOUS. “I have no idea what the reviewer meant.” Reviewer gets cleared, task re-enters the review queue. The system’s way of saying “that wasn’t a verdict, that was a paragraph.”

The parsing order matters. REVISION+SPAWN is checked first because its regex is a superset. If you checked SPAWN first, a REVISION+SPAWN verdict would match the SPAWN pattern and lose the revision feedback. The chain is arranged like exception handlers: most specific catch first.

The Circuit Breaker for Perfectionism

Three revisions. That’s the limit.

if (atRevisionCap) {
  logger.warn(
    { taskId: task.id, taskTitle: task.title,
      revisionCount: task.revisionCount,
      maxRevisions, reviewer: task.reviewer?.name },
    'Review worker: revision cap reached, auto-approving instead of cycling again'
  )
}

When the revision count hits the policy maximum (default: 3), any further REVISION verdict gets silently overridden to APPROVED. The reviewer says “try again.” The system says “no, we’re done here.” The event log records the verdict as APPROVED_AT_CAP so there’s a paper trail, but the reviewer never knows their verdict was overruled.

This is the pragmatic ceiling. Without it, a perfectionist reviewer and an eager-to-please worker could cycle forever: revision, attempt, revision, attempt, each round costing API dollars and producing marginally different output. The cap says “three rounds of feedback is enough. Ship it or don’t, but stop iterating.”

It’s also the system admitting something uncomfortable: past a certain point, more review doesn’t mean better output. It just means more review.

The Suspicion That Doesn’t Act

Here’s the philosophical heart of the whole pipeline. After a task is approved, the review-worker runs a heuristic:

const suspiciousApproval = hasWorktree
  ? !hasCommittedDiff && !hasUncommittedWorktreeChanges
    && !hasSubstantialResult
  : !hasSubstantialResult

Three signals. Did the worktree produce a git diff? Are there uncommitted changes? Does the task result contain at least 50 characters of substance? If all three answers are no, the approval is flagged as suspicious. A task was approved, but there’s no evidence any work was done.

The system logs a warning. Increments a Prometheus counter. Records the signals in a review activity entry for the ops dashboard. And then it approves the task anyway.

This is the part that gets me. The comment in the code reads:

// Suspicious-approval heuristic only: never override an APPROVED verdict here.
// If reviewer quality is weak, fix the reviewer contract instead of creating a
// hidden second gate in the worker.

That comment is arguing with The System That Approved Its Own Failures. In that era, the system had a hard gate: empty diff plus approval equals rejection, full stop. The reviewer said “approved” and the pipeline said “I don’t believe you.” It was a phantom gate that nobody asked for, making decisions that should have been the reviewer’s responsibility.

The current code explicitly refuses to do that. It says: if the reviewer is approving empty work, that’s a reviewer problem. Fix the reviewer’s prompt. Fix the reviewer’s criteria. Don’t build a shadow judge inside the worker that second-guesses verdicts after the fact.

The system went from cop to auditor. It still watches everything. It still keeps score. It just stopped arresting people.

Spawn Routing, or: How to Not Create an Infinite Loop

When a reviewer says SPAWN, someone has to figure out who gets the new task. This is resolveSpawnTarget, and it has three guards because the first two versions of it kept routing tasks in circles.

Guard one: writing detection. If the spawn title matches /\b(write|draft|create|compose|produce|genera)/i and the spawning reviewer is an editor, reroute the task to a writer. Otherwise the editor who said “someone needs to write this” gets assigned their own spawn request back.

Guard two: draft existence. If the target specialty is “editor” but no draft exists for the topic, reroute to a writer. This one is wild. It scans the filesystem:

const keywords = cleaned
  .split(/[\s\-_:,|()]+/)
  .filter(w => w.length >= 3 && !TOPIC_STOPWORDS.has(w))
const minMatches = Math.min(2, keywords.length)
for (const dir of DRAFT_SEARCH_DIRS) {
  const files = readdirSync(join(projectPath, dir))
  for (const file of files) {
    const hits = keywords.filter(kw =>
      file.toLowerCase().includes(kw)).length
    if (hits >= minMatches) return true
  }
}

Extract keywords from the task title. Strip stopwords (bilingual: English and Spanish, because we run a Cadiz-based project). Scan draft directories. Require at least two keyword hits in a filename. A reviewer can’t review what hasn’t been written, and the system now checks the literal filesystem to verify the precondition.

Guard three: exclude the spawner. The reviewer who issued the SPAWN is excluded from being assigned the spawned task. Because if you say “someone else needs to handle this,” you shouldn’t be that someone else.

The REVIEWER_TO_WRITER map that powers guard one has exactly one entry: { editor: 'writing' }. The entire spawn routing subsystem, all three guards and the fallback chain, exists because editors kept getting their own work back.

The Git Lane

When a worktree task gets approved, it doesn’t merge immediately. The review-worker creates a GIT_APPLY job in a dedicated queue with a parallelism limit of one. One merge at a time. Review processing and git operations are completely decoupled.

If the merge succeeds: worktree gets cleaned up, task goes to DONE. If there’s a conflict: a resolver task gets spawned, assigned to the original owner (or any agent with Bash and file-editing tools), and the parent task stays BLOCKED until the conflict is resolved. If the resolver task would be a duplicate, or retries are exhausted, the task gets tagged human_review_required and waits for someone with hands.

The separation matters because a merge conflict used to be able to crash the review loop. The git lane means review keeps ticking even when git is having a bad day. Different failure domains, different queues, different parallelism limits. Review runs two concurrent executions. Git runs one. The review-worker hands off the envelope and moves on.

Two Systems That Classify the Same Things Differently

There are two independent classification systems, and they don’t agree with each other.

The executor’s classifyReviewTask determines what review criteria the reviewer sees. Code reviews get told to inspect the diff. Research reviews are told that no-diff outputs are acceptable. Content reviews check editorial standards.

The review-worker’s classifyReviewLane sorts tasks into risk tiers: critical (database, auth, deployment), state-machine (worker, executor, pipeline), content, and safe. It’s tagged “Phase 1: logging only, no behavioral change.” The classification runs, the lane gets recorded, and nothing happens differently.

One system shapes what the reviewer looks for. The other system watches and categorizes, preparing for a future where risk level determines routing. Phase 2 isn’t built yet. The road is paved. Nobody’s driving on it.

The Architecture of Institutional Doubt

The review pipeline is what happens when a system gets burned enough times to develop trust issues but stays functional enough to not let those trust issues run the show.

It checks for suspicious approvals but doesn’t block them. It caps revision loops but doesn’t tell the reviewer. It classifies risk but doesn’t act on it yet. It scans the filesystem for draft files before routing editorial tasks because editors kept getting phantom review requests for articles that didn’t exist.

Every guard in here is a scar from a specific incident. The self-review prevention (three layers deep, because the DB constraint crash was that bad). The PROCESSED status flag on execution runs (because without it, every tick re-processed the same verdict forever). The spawn dedup fallback (because duplicate child tasks used to pile up).

Eight Gates and a Loop was the question: should this task run? This is the answer to the next question: was it good enough? Five verdicts, nine guards, two classification systems, and a heuristic that knows when something smells wrong and has made peace with not doing anything about it.

The system learned to distrust itself. And then it decided that distrust, properly instrumented and logged, is more useful than distrust that blocks things. Observe. Record. Let the reviewer contract do its job. Fix the contract if it’s broken.

804 lines of automated middle management that finally learned: suspicion is data, not policy.