The System That Approved Its Own Failures

Four missions completed successfully. The agents executed. The reviewers reviewed. The worktrees merged. The dashboard turned green.

Zero lines of code shipped.

Someone went through every “completed” mission manually. Opened the task. Checked the worktree. Looked at the diff. Empty. Not “wrong output.” Not “bad code.” Empty. The agent ran, produced a 400-word response describing what it would write, wrote nothing, and the reviewer read the 400-word response and said APPROVED.

This post is the inverse of The System That Worked Perfectly. In that post, the pipeline was too strict: proposals silently rejected, reviews silently failing, staleness gates silently cleaning up. Everything correct, nothing produced. That was a pipeline that said “no” too quietly.

This is a pipeline that said “yes” to nothing.

The Phantom Deliverables

The four phantom missions are hardcoded in heartbeat.ts because that’s what you do when you find a specific failure and need to name it:

const PHANTOM_MISSIONS = [
  'Build user dashboard with transform history and plan usage tracking',
  'Surface agent memories into task execution prompts',
  'Post-integration QA: Transform execution pipeline cap-gates',
  'Stripe billing mission',
]

Four titles. Each one went through the full pipeline. Think cycle proposed it. Approver approved it. Mission worker broke it into steps. Tasks were assigned. Agents executed in worktrees. Reviewers reviewed the execution results. Tasks moved to DONE. Missions moved to COMPLETED.

The dashboard said: 4 missions completed, 100% success rate.

The filesystem said: zero bytes changed.

Here’s the mechanism. An agent gets a task like “Build user dashboard with transform history.” The task has a description, an objective, maybe context from a previous step. The agent runs in a worktree. It has full tool access: Read, Write, Edit, Bash, Glob, Grep. It can create files, modify files, run commands.

Instead, it writes a response. A thoughtful, well-structured response. “I’ll create the following components: a TransformHistory.vue component that displays…” Three paragraphs of planning. No code. No file writes. No git diff.

The response goes to the reviewer. The reviewer gets the agent’s output and a prompt asking “is this work acceptable?” The reviewer reads a coherent plan. The plan makes sense. The plan addresses the task requirements. The reviewer says APPROVED.

The system checks: did the reviewer approve? Yes. Move to DONE.

Nobody checked: did the agent produce anything?

The Content Validation Gate

The fix is twelve lines that should have been there from the start:

const hasWorktree = task.worktreeBranch && task.worktreePath && task.project?.path
const hasCommittedDiff = hasWorktree
  ? !!getWorktreeDiffStat(task.project!.path!, task.worktreeBranch!)
  : false
const hasUncommittedWorktreeChanges = hasWorktree
  ? hasWorktreeUncommittedChanges(task.project!.path!, task.worktreeBranch!)
  : false
const isPhantom = hasWorktree
  ? !hasCommittedDiff && !hasUncommittedWorktreeChanges
  : (task.result?.length ?? 0) < 50

Two paths. If the task has a worktree (it’s a code-writing task on a git project), check for actual changes: committed diff via git diff --stat, and uncommitted changes via git status --porcelain. If both are empty, the agent didn’t write anything. Phantom.

If the task doesn’t have a worktree (non-code task), check whether the result text is longer than 50 characters. Below 50 characters, there’s no meaningful output. This catches the edge case of agents that return a one-word acknowledgment and nothing else.

When a phantom is detected, the task doesn’t go to DONE. It goes back to ASSIGNED with a system note appended to the description:

[SYSTEM] Approval rejected: no deliverable content detected.
The task produced no code changes or meaningful output.
Please redo the work.

The reviewer’s approval is overridden. The agent gets another shot with an explicit note explaining that “write a plan about writing code” is not the same as “write code.”

This is a hard gate. The reviewer can say APPROVED a hundred times. If the diff is empty, the system says no.

The BLOCKED Graveyard

While hunting phantom deliverables, we found a second category of dead missions. Not phantoms. Not failures. Just… stuck.

const BLOCKED_DEATH_MISSIONS = [
  'Wire up cap-gates and cost-calculator into execution pipeline',
  'WS-06 CI integration tests',
  'Harden heartbeat: observability metrics, idle conversation cleanup, dangling resource pruning',
  'Diagnose and mitigate Claude invocation timeout spike',
  'Graceful shutdown mission',
]

Five missions that completed or failed, but their tasks were permanently BLOCKED. Read that first title again. “Wire up cap-gates and cost-calculator into execution pipeline.” The mission to integrate the capacity safety system was itself killed by a missing safety system. The task to add timeout diagnostics timed out. The graceful shutdown mission did not shut down gracefully. If you wrote this in a novel, your editor would tell you the symbolism was too on-the-nose.

BLOCKED is the status a task enters when it’s waiting for something. A child task to complete. A merge conflict to resolve. A human to review. It’s a legitimate state. The problem was the timeout schedule:

Status	Timeout
INBOX	72 hours
ASSIGNED	48 hours
IN_PROGRESS	6 hours
REVIEW	24 hours
BLOCKED	∞

BLOCKED had no timeout.

Every other status had a staleness gate. A task sitting in INBOX too long gets auto-cancelled. ASSIGNED too long? Auto-cancelled. IN_PROGRESS for 6 hours? Something’s wrong, auto-fail. But BLOCKED could sit forever. And “forever” turned out to be about two weeks before someone noticed that five missions were silently dead, their tasks BLOCKED on children that had themselves failed without triggering the unblock cascade.

`recoverStaleBlockedTasks()`

The recovery function runs in the heartbeat cycle, every tick. Three time horizons:

const BLOCKED_WARNING_MS = 60 * 60 * 1000       // 1 hour
const BLOCKED_RECOVERY_MS = 6 * 60 * 60 * 1000  // 6 hours
const BLOCKED_HUMAN_MS = 24 * 60 * 60 * 1000    // 24 hours

1 hour: Warning. Emit a task_blocked_warning event. The Telegram bridge picks it up, someone gets a notification. Warning dedup: each task gets one warning, re-armed after 24 hours in case the task gets re-blocked later.

6 hours: Auto-recovery. If the task has active children, fail the children. The worker’s unblock logic will notice the failed children and unblock the parent on the next tick. If the task has no active children (orphan BLOCKED), fail it directly.

24 hours: Human review escalation timeout. Tasks tagged human_review_required get a longer leash because a human literally needs to look at them. But not infinite. After 24 hours, even human-review tasks get auto-failed.

The cleanup also prunes the warning-dedup map. Every tick, it checks which tasks are still BLOCKED and removes entries for tasks that have moved on. This keeps the in-memory set bounded. A small detail that prevents a small memory leak.

The One-Time Recovery

The phantom and BLOCKED-death missions needed a one-time fix. recoverKnownFailedMissions() runs once per deploy:

async function recoverKnownFailedMissions(): Promise<void> {
  if (recoveryDone) return
  recoveryDone = true

For each known failed mission, it resets the mission to APPROVED, clears all step-task linkages so the mission worker creates fresh tasks, and logs the recovery as an activity.

For phantom missions specifically, it goes further: the old “completed” tasks (the ones with empty diffs that were incorrectly marked DONE) get their status flipped to FAILED. The audit trail shows them as invalidated. The new tasks created by the mission-worker restart get a clean slate.

This is the kind of code that should be temporary. A one-time migration, remove after deploy, never look at it again. But we kept it, because the titles are also documentation. Five BLOCKED-death missions. Four phantom missions. Nine names on a wall, each one a specific failure of the review pipeline to do its job.

Review Hardening

The phantom gate catches the output problem. But the input problem, the reviewer saying APPROVED when there’s nothing to approve, pointed at a deeper issue.

The review worker had three verdict paths. APPROVED moves to DONE. REVISION sends back for rework. And then there was path 3: anything else. If the verdict parser couldn’t classify the reviewer’s response, the system defaulted to approval. Ambiguous? Ship it. Unparseable? Ship it. Reviewer returned a haiku instead of a verdict? Ship it.

The new path 3:

// Truly ambiguous verdict: send back for a fresh review
// instead of auto-approving

Ambiguous verdict? Strip the reviewer assignment. Put the task back in the review queue. A different reviewer picks it up. If the second reviewer is also ambiguous, it cycles again. The task never auto-approves. It either gets an explicit APPROVED or REVISION, or it sits in the review queue until a human notices.

The same principle applies to merge failures. Previously, if a worktree merge produced a conflict and the conflict-resolution task was deduped by the task factory (because an identical resolution task already existed), the parent task auto-approved. Now it escalates: human_review_required tag, BLOCKED status, 24-hour timeout before auto-fail.

The pattern is the same across all these changes. When in doubt, stop. Don’t approve. Don’t fail. Don’t auto-resolve. Stop, mark it, wait for clarity. The system was optimized for throughput. Every ambiguous situation resolved in the direction of forward progress. Unclear verdict? Approve, keep moving. Deduped conflict resolution? Approve, keep moving. Empty diff? The reviewer said it’s fine, keep moving.

Every one of those “keep moving” decisions was a phantom waiting to happen.

The Mirror

The System That Worked Perfectly was about a pipeline that was too strict. Zero proposals passed validation. Zero reviews completed within budget. Zero missions survived the staleness gates. The lesson was: silent rejection kills output without leaving evidence.

This is the same lesson, opposite polarity.

Too strict: nothing gets through. You notice because the output queue is empty and someone asks “why haven’t the agents produced anything this week?”

Too permissive: everything gets through. You don’t notice because the dashboard is green. The missions are COMPLETED. The success metrics are climbing. You have to physically open each deliverable and check whether it contains anything, which nobody does when the dashboard says everything is fine.

The strict failure was found in three days because someone checked the mission queue and it was empty. The permissive failure ran for longer because the queue was full. Full of ghosts, but full.

Both failures share the same root cause: the system was checking process compliance, not outcome quality. Did the proposal pass validation? Did the reviewer approve? Did the merge succeed? These are process questions. They verify that each stage of the pipeline executed correctly.

Nobody asked: did anything useful come out the other end?

The system that worked perfectly produced nothing. The system that approved its own failures produced nothing with a green checkmark on it. Same outcome. Different dashboard color.

I’m not sure which one is worse.