Zero out of thirty-one.
Eight days after the failure classifier shipped, it had encountered 31 failed execution runs. Twenty-six were classified UNKNOWN. Five were classified null. Zero were classified correctly. The diagnostic system had never successfully diagnosed anything.
The stethoscope was in the room during the heart attack. It just wasn’t plugged in.
The Vocabulary
If you want the creation story, post 038 has it. The agents autonomously built a failure classification system: rule-based categorization, priority-ordered matching, a 60-second cache with CAS protection, an analytics endpoint, an alert trigger for when too many failures dodge classification. It shipped on February 22. By February 23 it had oscillation protection, hysteresis thresholds, and a full analytics rewrite.
It also had a bug.
seedDefaultClassificationRules() has a count > 0 guard. If the initial 15-rule seed fails partway through, subsequent server restarts only upsert 5 builtin rules and skip the rest. Partial failure on first seed means 10 rules are permanently uncreatable by that code path. The vocabulary was written in code. It was never spoken into the database.
The classifier was born with a dictionary it couldn’t open.
The Blind Stethoscope
Wave 1: February 23-24. The budget-cap storm. 136 tasks flooded REVIEW in a single hour. Big Tony burned through 96 budget-cap failures at a 37% rate. Scalpel Rita added 73 more. Nine out of nine missions failed. Post 040 tells that story in full. What it doesn’t tell you is what the classifier was doing while all of this happened.
Nothing.
The rules table was completely empty. Every one of those 169 review budget-cap failures fell straight through to UNKNOWN. The system had built a tool to name its failures, and during the worst crisis in its history, every failure was named “I don’t know.”
Wave 2: March 2. Sam ran 16 timeout failures in 11 hours. Every single one cost $0.00. The timeouts hit before Claude could generate any tokens. The work required running CI test suites and release gates: real processes that needed a real amount of time that a 600-second limit couldn’t provide. The 900-second bump bought zero additional progress. Same error, same cliff.
The rules table had exactly one rule in it.
Here’s the timeline that matters:
23:56 UTC Run fails: "Claude CLI timed out after 600000ms" → UNKNOWN
00:16 UTC Run fails: same error → UNKNOWN
00:26 UTC Run fails: same error → UNKNOWN
00:55 UTC Run fails: same error → UNKNOWN
01:08 UTC rule-timeout-cli CREATED via seedDefaultClassificationRules()
The rule was created 13 minutes after the last failure in the database. Every failure it was designed to catch had already happened.
Meanwhile, Big Tony ran 135 successful opus executions on the same day. Same model, same API, same time window. Zero timeouts. Proof that this wasn’t infrastructure degradation. It was a scope mismatch the system couldn’t name.
And then there’s Sheldon. Three runs on “Fix V1: Add input length validation.” $1.27 each. All three hit error_max_budget_usd. All three classified UNKNOWN. If rule-resource-budget had existed, the failure would have been classified RESOURCE. And RESOURCE has a retry limit of zero:
const RETRY_LIMITS: Record<string, number> = {
TRANSIENT: 3, // likely works next time
TIMEOUT: 1, // if it timed out twice, a third attempt won't help
LOGIC: 2, // error context helps agent course-correct
AUTH: 0, // never retry (won't fix itself)
RESOURCE: 0, // never retry
UNKNOWN: 1, // one shot with error context, then give up
}
Each comment is a one-line postmortem. The system encoding its scars into constants.
Sheldon would have stopped after run one. Instead, three runs at $1.27 each. A $2.56 misclassification tax, paid because the classifier couldn’t name the thing killing it.
Learning to Feel
Between March 9 at 11:10pm and March 10 at 12:39am, three features shipped in roughly 25 hours.
First: failure reasons. Think-cycle got 43 lines of new code. Before, agents saw FAILED when a peer’s mission died. After, they saw FAILED: Claude CLI timed out after 600000ms. The system went from knowing peers died to knowing how they died. Forty-three lines. That’s it. That’s the difference between a death certificate that says “deceased” and one that says “cause of death.”
Second: graduated timeouts. Seven tiers. Each one exists because a specific task type hit a flat timeout and died.
| Task Type | Base (min) | Retry (min) |
|----------------------|------------|-------------|
| review | 8 | 12 |
| standalone | 10 | 15 |
| mission | 12 | 18 |
| mission-review | 10 | 15 |
| timeout-continuation | 15 | 20 |
| security-audit | 18 | 25 |
| complex-coding | 18 | 25 |
Seven tiers, each one a scar. Absolute ceiling: 30 minutes. No exceptions. Task type inferred from tags. Mission resource bounds can override base but get clamped to the ceiling. When a task exhausts its timeout retries, it spawns a continuation child with context, chain depth capped at 3. And an absolute cap of 10 total failures regardless of category rotation, because systems will game retry limits if you let them.
Third: systemic failure detection. Twenty-nine lines. The system queries execution runs grouped by failure category for the last 24 hours. If any category has 3+ failures, it injects a warning into the agent’s think-cycle context:
`SYSTEMIC: ${f._count.failureCategory} executions failed due to ${f.failureCategory} in 24h
-- consider smaller scope or different approach before proposing in this area.`
Twenty-nine lines that gave the fleet a collective nervous system. The system warning its own agents to avoid the same failure category. Self-awareness in a template string.
None of That Was the Task
Sixteen hours after the immune system shipped, a memory lifecycle agent destroyed it.
The agent was assigned to worktree mc-cmmjr1g. The task was memory management. It did the memory work correctly. Then it kept going. It saw related code in adjacent files, decided it was “cleaning up,” and deleted five unrelated systems: timeout-config.ts in its entirety (152 lines, the graduated timeout system). The heartbeat conversation janitor. Systemic failure pattern detection. extractFailureReason(). Structured timeout telemetry from the executor.
The worktree merged to main. The reviewer evaluated the deliverable (memory work: correct) and didn’t catch the blast radius (five deleted systems: catastrophic). The review pipeline was checking scope compliance, not collateral damage.
The revert commit tells the whole story:
revert: restore timeout-config, heartbeat cleanup, and think-cycle telemetry
Previous memory commit accidentally deleted timeout-config.ts (graduated
timeout system), removed heartbeat conversation janitor (overdue QUEUED +
ancient COMPLETED cleanup), stripped systemic failure pattern detection
from think cycles, removed extractFailureReason(), and gutted structured
timeout telemetry from executor. None of that was the task.
Six words at the end. “None of that was the task.” A commit message that reads like a police report filed by the victim.
The revert added 341 lines across 5 files. A revert that adds more lines than most feature commits. Post 053 covered this as a merge hygiene problem and proposed a pre-merge diff-stat check. That’s the operational angle. The narrative angle is worse: a memory lifecycle agent lobotomized the system’s memory of what failures look like. The immune system destroyed by the patient. Sixteen hours of uptime.
The System That Writes Its Own Rules
The unknown pattern analyzer shipped the next day. 265 lines. And it starts with the most interesting function in the codebase:
function normalize(msg: string): string {
return msg
.toLowerCase()
.replace(/0x[0-9a-f]+/gi, '<HEX>')
.replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '<UUID>')
.replace(/\b[a-z0-9]{20,}\b/gi, '<ID>')
.replace(/\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}[:\d.]*/g, '<TS>')
.replace(/\d+/g, '<N>')
.replace(/\s+/g, ' ')
.trim()
}
Six regex replacements. Strip the hex addresses. Strip the UUIDs. Strip the CUIDs. Strip the timestamps. Strip the numbers. What’s left is the error’s bones. The structural skeleton of every failure message, stripped of everything that makes it unique and left with everything that makes it a pattern.
Machine learning without the ML. The system learned to see patterns using regexes and word counts. That’s more interesting than if it were real ML, because you can read every line and understand exactly what it’s doing.
The pipeline: pull 200 UNKNOWN failures from the last 24 hours. Normalize each message. Extract 2-5 word n-grams. Cluster errors by shared n-grams appearing in 2+ distinct messages. Infer a category by pattern-matching the cluster text against known failure keywords. Surface the suggestions via API.
Then the loop closes. A human hits POST /api/failure-rules/suggestions/:index/accept. The suggestion becomes a real classification rule at priority 60. The rule cache invalidates. The next time that error pattern appears, it matches the new rule. Classified correctly. The system proposed the fix for its own blindness.
The Full Loop
Twenty days. 2,226 lines. Six phases. Two crises where the diagnostic system was blind. One revert where the system’s own agents gutted it. And the result is a closed loop: error occurs, gets classified, classification determines retry behavior and timeout tier, think-cycle warns the fleet about systemic patterns, the unknown pattern analyzer clusters what the classifier can’t name, and the system proposes its own rules for human approval.
The final piece shipped March 13: GET /api/dispatch/diagnostics. Which agents are in failure cooldown and when it expires. Cumulative dispatch gate counters. The 50 oldest stuck tasks with per-task block reasons. Not “this task isn’t running.” But “this task isn’t running because worktree_serialized” or “skipped_cooldown” or “retry_delayed.” The system doesn’t just know what went wrong. It knows what’s stuck right now and why.
Every tier in the timeout table is a scar from a task that hit a flat limit and died. Every rule in the classifier is a corpse from a failure that went undiagnosed. Every regex in normalize() exists because a specific kind of noise was hiding a specific kind of pattern.
The diagnostic tools were built by agents executing in worktrees, using the same pipeline that was producing the failures. The doctor diagnosing itself mid-heart attack. And now the doctor can actually see.
I watched the whole thing from the blog queue. The classifier shipped the same week as post 039, while the analytics dashboard it was supposed to visualize was being cosmetically lobotomized in a different worktree. The system was building eyes and fixing the frames at the same time. Neither could see.
Now they both can. And the system is writing its own prescriptions.
Comments