1,742 Errors and Nobody Noticed

Eight Gates and a Loop documented the worker tick: setInterval(workerTick, 30000). Every 30 seconds. 2,880 ticks per day. The system’s heartbeat. The thing everything depends on.

On March 25, 1,742 of those ticks failed. Consecutively. For 14.5 hours. Zero tasks executed. The queue grew. No alerts fired. The system looked calm from the outside because the only metric that would have caught it was a counter that nobody was watching.

That was one of three failures running simultaneously. None of them were loud.

Failure One: The Silent Crash

The worker tick wraps its entire body in a try/catch:

} catch (err) {
  tickErrored = true
  consecutiveErrors++
  workerTickCounter.inc('errored')
  logger.error({ err, consecutiveErrors }, 'Worker tick error')
} finally {
  if (!tickErrored) {
    consecutiveErrors = 0
    lastTickCompletedAt = new Date()
  }
  tickInProgress = false
}

On success: consecutiveErrors resets to zero. On failure: increment and log. The counter is an in-memory integer. It’s exposed via the metrics endpoint. It’s not monitored by anything.

At 06:00 on March 25, the worker tick started failing. The error:

[Scope Validation] Task scope exceeds file_review limit

One code path in the review worker spawns a child task when a reviewer says REVISION+SPAWN. That spawn goes through createTaskWithPolicies(), which validates scope limits. If the spawned task is too large for the review scope, it throws. The throw was unguarded. It propagated up through processReviews(), through the worker tick, into the catch block.

Same error. Same task. Every tick. For 14.5 hours.

1,742 is not a crash count. It’s a loop. The system was alive. The process was running. The logger was writing. The metrics counter was incrementing. The worker was doing everything except working.

The fix was two try/catch blocks around the spawn paths. Scope validation error on the REVISION+SPAWN path: fall back to applying the revision without spawning the child. Scope validation error on the SPAWN-only path: mark the parent task done and continue. Neither path needed to propagate the exception. The error was about one specific task being too large. It was not about the worker being broken. But without the catch, one task’s problem became every task’s problem, for every tick, until someone looked at a counter that nobody looks at.

Failure Two: The Merge Deadlock

Six tasks had been stuck in BLOCKED for days. All tagged merge-conflict and human_review_required. All owned by the same agent. All sitting in the same project.

The context: tasks that write files run in isolated git worktrees. Each task gets its own branch. When the task is approved, the system merges the branch back to main. Five Verdicts and a Suspicion described this handoff. What it didn’t describe is the merge strategy.

Before the three-way merge, the system attempts a rebase:

if (mergeBase !== mainHash) {
  logger.info({ branch, defaultBranch }, 'mergeWorktree: branch behind default, attempting rebase')
  try {
    execSafe('git', ['rebase', defaultBranch], worktreeDir)
    branchHash = execSafe('git', ['rev-parse', branch], projectPath)
  } catch (rebaseErr: unknown) {
    try { execSafe('git', ['rebase', '--abort'], worktreeDir) } catch { }
    const msg = rebaseErr instanceof Error ? rebaseErr.message : String(rebaseErr)
    logger.warn({ branch, defaultBranch, err: msg },
      'mergeWorktree: rebase failed, falling through to merge-tree')
    try {
      branchHash = execSafe('git', ['rev-parse', branch], projectPath)
    } catch { }
    // Don't return failure here — fall through to merge-tree
  }
}

That comment at the bottom. “Don’t return failure here.” That’s the fix. Before this week, the catch block returned { success: false } immediately. Rebase failed? Conflict. Tag it. Block it. Move on.

The problem: rebase and merge-tree are not the same operation. Rebase replays commits one by one. It’s strict about intermediate states. If commit A modified a file that also changed on main, the rebase fails at commit A, even if the final state of the branch doesn’t conflict with the final state of main. Merge-tree does a three-way merge of the final states. It’s lenient. It answers a different question: “can the endpoints be combined?” instead of “can the journey be replayed?”

Most of the six blocked tasks were stale-branch false positives. The branches had diverged from main by unrelated commits. The rebase failed because the replay was strict. The merge-tree would have succeeded because the endpoints were compatible. Nobody checked, because the old code returned failure on rebase and never reached merge-tree.

The human_review_required tag was aspirational. No human review mechanism existed. The tag was a Post-it note on a locked door. The tasks were stuck permanently.

The Cascade

The merge fix included a cascade. After a successful merge, the system now re-queues other blocked tasks in the same project:

async function requeueBlockedSiblings(projectId: string, excludeTaskId: string): Promise<number> {
  const siblings = await prisma.task.findMany({
    where: {
      projectId,
      id: { not: excludeTaskId },
      status: 'BLOCKED',
      tags: { has: 'merge-conflict' },
      worktreeBranch: { not: null },
    },
  })
  // Strip human_review_required tag, re-queue for merge attempt
  // Capped at 3 total attempts per task
}

One task merges successfully. The system checks: are there siblings in the same project that are blocked on merge conflicts? Re-queue them. Maybe their conflicts were false positives too. Cap at three attempts, because genuine conflicts exist and infinite retry is not a strategy.

The controlled re-queue confirmed the theory. Some tasks merged cleanly on first retry. Stale branches, not real conflicts. The rebase had lied about them for days. Others had genuine three-way conflicts: the same files modified on both sides. Those needed re-execution with fresh worktrees, not merge retries.

Failure Three: The Lock That Never Lets Go

Separate investigation, same week. A confirmed code defect:

Eight Gates and a Loop documented step 3b of the worker tick: release worktree locks for BLOCKED tasks. The comment in the code reads:

// Step 3b: Release worktree locks for BLOCKED tasks immediately.
// BLOCKED tasks aren't actively writing; holding the lock creates deadlocks
// because their children need the same project's worktree to dispatch.

Step 3b releases locks for BLOCKED tasks. There is no step 3b for REVIEW.

export const ACTIVE_WORKTREE_TASK_STATUSES: TaskStatus[] = ['IN_PROGRESS', 'REVIEW', 'BLOCKED']

Three statuses hold the worktree lock: IN_PROGRESS, REVIEW, BLOCKED. Step 3b handles BLOCKED. Nothing handles REVIEW. A task that enters REVIEW holds its project’s worktree lock until the review completes and the task transitions to either DONE or BLOCKED. For automated reviews, that’s minutes. For human-reviewed tasks, that’s hours to days.

During that window, no other worktree-writing tasks can dispatch for the same project. They sit in the dispatch queue, evaluated every 30 seconds, skipped every time by the worktree gate. The project is locked by a task that isn’t writing anything. It’s waiting for a review verdict.

From the diagnosis document:

“This is a confirmed code defect, not a hypothesis. The code is unambiguous: step 3b releases BLOCKED locks explicitly, and there is no equivalent for REVIEW.”

Not fixed this week. Documented and queued. But confirmed.

The Counter Nobody Watches

Three failures. Running concurrently. For days.

A worker generating 1,742 errors over 14.5 hours, executing zero tasks, logging every failure, alerting nobody.
Six tasks permanently blocked on a condition that wasn’t a real conflict, tagged for human review that didn’t exist.
Every human-reviewed task holding its project’s worktree lock for the full review window, silently blocking all sibling dispatch.

The queue grew. That was the only visible symptom. Queue length is a lagging indicator on a system that processes work in bursts. A long queue could mean “lots of work came in.” It could also mean “nothing is processing.” Without context, the queue length is ambiguous. Ambiguous metrics are the same as no metrics.

let consecutiveErrors = 0

That’s the variable. An integer. In memory. Exposed on the metrics endpoint. Available if you know to look. Nobody knew to look because the counter had never been interesting before.

Visibility and observability are not the same thing. Visibility means the data exists. Observability means the data acts. A counter that increments to 1,742 without triggering an alert is visible. It’s not observable. It’s a number in a JSON response that nobody fetched. If the system had been failing faster, someone might have noticed. If it had been failing louder, something would have paged. It did neither. It failed at exactly the pace and volume that made it invisible.

The system was breaking in whispers. 1,742 of them.

Failure One: The Silent Crash

Failure Two: The Merge Deadlock

The Cascade

Failure Three: The Lock That Never Lets Go

The Counter Nobody Watches

Comments

$ ls ./related