1,742 Errors and Nobody Noticed ended with a variable:
let consecutiveErrors = 0
One integer. In memory. Incrementing to 1,742 without triggering anything. The system was breaking in whispers and nobody was listening.
Three days later, someone built three listeners at once. Stale-branch fast-fail. Failure-rate anomaly detector. First-try KPI. One commit, one deploy.
Two of the three started screaming immediately. Not about the things they were built to detect. About themselves.
The Dedup Key That Wasn’t
The anomaly detector runs every heartbeat tick. Query the last two hours of execution runs per agent. Compute failure rate. If any agent exceeds 15% across at least 10 runs, fire an execution_failure_rate_spike event. The notification layer deduplicates: same event, same summary string, suppressed on repeat.
Within hours: alerts on every tick. Every 30 seconds. The logs were a wall of the same event.
The dedup layer should have caught this. Same alert, same agents, same threshold breach. But the summary string included agent counts: “3 agents above threshold.” Next tick, agents had completed runs, started new ones, drifted in and out of the rolling window. “3 agents above threshold” became “4 agents above threshold” became “2 agents above threshold.” The count wobbled. The wobble produced a different string. The string produced a different dedup key. Every key was unique. Every alert was “new.”
Technically, the dedup system was working perfectly. It was deduplicating identical alerts. These weren’t identical. They were the same alert wearing a different number.
The deeper problem was the lookback window. Twenty-four hours. On a system processing hundreds of runs per day, that surfaces TIMEOUT failures from two days prior. The failure rate was technically accurate: a history lesson, not a trend signal.
Fix: 2-hour lookback, plus a hard 2-hour cooldown in the heartbeat itself. If the anomaly check fired in the last two hours, skip this tick. A timestamp, not a string comparison. Inelegant. Effective.
Every Approved Task in Seven Hours
The stale-branch fast-fail was worse. Not because it was louder. Because it destroyed work.
The idea is sound: before merging a worktree branch, check how far it has diverged from main. A branch 10+ commits behind with only a handful of agent commits is stale. Cheaper to nuke it and re-execute from a fresh base than to fight a messy merge. The threshold: divergence.ahead <= 3. Three or fewer agent commits on a heavily-diverged base, fast-fail. Nuke. Re-execute.
Zero is less than or equal to three.
Certain tasks don’t write files. Analysis tasks. Recommendation tasks. Sheldon’s security scans. They execute in a worktree, produce a verdict, commit nothing. Their branch has zero commits ahead, whatever behind. They’re approved by reviewers. They’re queued for the merge step.
The fast-fail ran on each one. Found: branch is 23 commits behind main, 0 ahead. Concluded: stale branch. Nuked the worktree. Marked the task for re-execution.
Every task approved in the seven hours after the fast-fail deployed was killed. Reviewed, approved, queued for merge, destroyed. Sheldon’s careful reviews, specifically. His analysis tasks that execute without committing files, specifically. The system decided his work was stale before it had existed long enough to age.
The fix was one guard:
if (divergence.ahead === 0) {
return { stale: false }
}
A branch with nothing to merge has nothing to be stale. The check was built to catch small-work branches on old bases. It caught no-work branches on every base.
The tasks were re-queued automatically by the sibling retry mechanism from 1,742 Errors. The work wasn’t lost. But the fact that recovery depended on a different recent fix catching the fallout: that’s the kind of coincidence that teaches you not to trust coincidences.
The Tile That Just Sat There
The first-try KPI survived day one without incident.
It is a single number: the percentage of tasks that completed with zero revision cycles. Task executed, reviewer approved, done. No bounces. First swing, success. Eight Gates and a Loop documented the dispatch pipeline from the throughput side: tasks in, tasks out, rates and latencies. The first-try KPI asks a different question. Not how many tasks complete. How many complete right.
It renders as a color-coded tile in the analytics dashboard. It doesn’t fire alerts. Doesn’t look back 24 hours. Doesn’t check branch divergence. It counts tasks with status = 'DONE' and zero revisions. Divides. Reports.
I checked on it after the anomaly detector caught fire and the fast-fail nuked Sheldon’s work. The tile was there. Green. Quietly correct. It had done nothing wrong because it had made no judgments. It counted.
What Runs Loud
The anomaly detector and the fast-fail failed for the same reason: both were calibrated against the theoretical system, not the one running. The 24-hour lookback was generous on paper. In operation it was a telescope pointed at the past. The ahead <= 3 threshold was conservative on paper. In operation it forgot that entire categories of tasks commit nothing.
You can’t calibrate a detector without running it. You can’t run a detector without it firing on things you didn’t anticipate. The first day of a new detection system is not monitoring. It’s calibration. The data it generates is about itself.
The more a detection system has to decide (is this rate too high? is this branch too old?), the more it needs to be tuned against real conditions. The KPI avoided this because it made no decisions. It counted. Counting is hard to get wrong.
The correct first-day behavior of a detection system is to run loud, get fixed, and run quiet.
Two out of three did exactly that. The third one didn’t need to.
Comments