Post 047 taught the pipeline to say no. Post 048 is about the eighteen hours it couldn’t say anything at all.

Three Root Causes, In Order of Impact

The investigation found 17 distinct restart timestamps across four days, affecting 24 execution runs. They weren’t all the same bug. Three things were killing the server, taking turns:

  1. Manual launchctl kickstart -k during dev sessions (~60%). The operator merging code and restarting. Normal. Expected. Not a bug. But every restart kills running agents mid-execution, and those killed runs count toward the 3-failure retry limit.

  2. Database pool crash loop (~25%). pool.connect() returning undefined, crashing the process. This one’s the story.

  3. tsx compilation failures from bad merges (~15%). An agent wrote a shell HEREDOC (<<EOF) into TypeScript source code. The worktree merged to main. tsx couldn’t compile it. With KeepAlive: true, the server restarted every 30 seconds into the same syntax error. 92 instances of TransformError: Unexpected "<<" in the stderr log.

await void === undefined

The pool bug was introduced in commit d90d907 as a guard clause. Someone had seen pool.connect() return undefined and crash downstream when trying to instrument the result. The fix: add an explicit check.

pool.connect = async function(...args) {
  const connection = await originalConnect(...args)
  if (!connection) throw new Error('Database pool.connect() returned undefined')
  return instrumentClient(connection)
}

Makes sense if you don’t think about it for more than thirty seconds.

Prisma’s PostgreSQL adapter calls pool.connect(callback). Callback-style. The function signature returns void. Not a Promise. Not undefined. void. The return value of a function that doesn’t return anything.

JavaScript lets you await anything. Even void. await void resolves as undefined. The guard clause fires. The process dies. The service restarts in 30 seconds. The service tries the same pool connection. await void. Undefined. Guard. Crash. Restart.

With KeepAlive: true and ThrottleInterval: 30 in the launchd plist, this produces 2 crashes per minute. 87 occurrences of the pool error in the stderr log. The guard clause designed to prevent connection failures was the thing causing them.

The Fix

The fix is not a one-liner. The old code treated callback-style and promise-style as the same thing. They’re not.

pool.connect = function instrumentedConnect(...args) {
  const callbackArg = (args as unknown[])[0]
  if (typeof callbackArg === 'function') {
    // Callback path: Prisma calls pool.connect(cb)
    // Returns void — don't await, don't check return value
    const callback = callbackArg as Function
    return originalConnect((err, client, release) => {
      if (!err && client) {
        callback(undefined, instrumentClient(client), release)
        return
      }
      callback(err, client, release)
    })
  }

  // Promise path: returns Promise<PoolClient>
  let promise = originalConnect()
  return promise.then((client) => instrumentClient(client))
}

if (typeof callbackArg === 'function'). That’s the fork. Callback path returns void; don’t touch the return value. Promise path returns a Promise; handle it with .then(). Two different calling conventions, two different code paths. The previous version pretended one path could serve both.

This bug existed for approximately 18 hours on March 4th. In that window, between the guard clause going live and the fix landing, the server crash-looped enough to generate 87 pool errors.

The Downstream Carnage

ImpactMagnitude
Execution runs killed24 (across 17 restarts)
Agents hitting cooldown2 (3 consecutive restart-induced failures triggers a 30-minute cooldown)
Orphaned worktrees retrying cleanup7 (every 5 minutes, ~200 stderr lines/day)
Provider registration failures13 UNKNOWN (workers ticked before providers initialized post-restart)

Each killed run consumed retry budget. An agent that got caught in three consecutive restarts hit the 3-failure limit and its task went to permanent FAILED. Not because the agent did anything wrong. Because the server crashed underneath it three times.

The 7 orphaned worktrees were from tasks interrupted mid-execution. Their worktree paths no longer existed, but the heartbeat janitor didn’t check for existence before calling git worktree remove. So every 5 minutes, 7 removal attempts, 7 failures, 7 error lines. Pure noise, drowning the real errors.

The Deeper Rot

While reading the stderr log, the investigation also found 533 P2021 errors (table doesn’t exist) and 321 P2022 errors (column doesn’t exist). These aren’t from the pool bug. They’re from schema drift: prisma db push alters the database, but the running server still has the old Prisma client compiled against the old schema. The client asks for tables and columns that the new schema renamed or removed.

Total Node.js v exit markers in the stderr log: 1,035. Most are crash-loop iterations. A single root cause can generate dozens of exits before someone fixes the bad code. The number sounds apocalyptic. The reality is three bugs (pool, HEREDOC, schema drift) feeding a restart loop with no backoff.

The System That Survived Its Own Hardening

Post 047 added content validation gates. Phantom detection. Review hardening. The pipeline got stricter, smarter, more cautious. And then a two-line guard clause took the whole thing down for 18 hours.

The investigation found three classes of failure: human-initiated restarts (expected), a pool instrumentation bug (preventable), and agent-written syntax errors merging to main (architectural). The first is the cost of development. The second was fixed. The third needs a pre-merge tsc --noEmit check before worktree code can land on the main branch.

The system that survived its own hardening is more interesting than the one that ships on the first try. Post 047 made the pipeline say no to phantom deliverables. Post 048 is the pipeline learning that the guard at the gate needs to know the difference between a callback and a promise. Most guards do. The lesson echoes the comprehensive debugging process we used to understand cascading failures: when the fix creates the problem, you’re working in the wrong abstraction layer.