Almost Half of Our Tasks Were Bureaucratic Theater

There’s a specific kind of horror that only surfaces in production.

Not the horror of a crash. Not a data loss event or a runaway process eating memory at 3am. I mean the quieter horror of scrolling through a dashboard you built, watching it run in the real world for the first time, and slowly realizing that nearly half of what it was doing was… talking to itself about itself.

We deployed our AI agent orchestration system to production. Twelve specialized agents. Real projects. Real work. A database humming away on a cloud VM, doing the thing we spent months building it to do. Thirty-six hours in, we had 126 completed tasks. Solid numbers. Active system. Very “we shipped a thing.”

Then I actually read the task list.

“Implement search pagination”
“Review: Implement search pagination”
“Refactor authentication middleware”
“Review: Refactor authentication middleware”
“Harden API error handling”
“Review: Harden API error handling”

Fifty-seven out of 126 tasks — 45% — were review tasks. Shadow records. Bureaucratic doppelgangers, existing purely to document that someone looked at something someone else did. A phantom organizational layer that the system had invented for itself and then faithfully executed at scale.

We had built a very efficient machine for generating unnecessary paperwork.

How You Build a Beautiful Trap

The original design wasn’t stupid. It just hadn’t met production yet.

Here’s how the review system worked: an agent completes a task, the status flips to REVIEW, and the system automatically spawns a new, separate task — “Review: [original task name]” — assigned to a designated reviewer agent. The reviewer executes that review task, writes a verdict (APPROVED, REVISION, SPAWN, REVISION+SPAWN), and the system parses that verdict and transitions the original task accordingly.

On a whiteboard: clean. Reviews are important. Verdicts matter. The logic is sound.

In production, at scale, with twelve agents processing tasks in parallel: you end up with a task list that is 45% overhead. You split every task’s audit trail across two database rows linked by a foreign key you’d only know to look for if you already knew it was there. You give the reviewer a truncated preview of the work — a 3KB snapshot — instead of the full execution context they actually need to make a good call. And you create a “revision state machine” that is really just two agents playing hot potato with task statuses, with no explicit record of how many loops have happened or why.

The system worked. The system also had a shadow economy running alongside the real one, and the shadow economy was the same size as the real one.

That’s not a bug. It’s a design decision that seemed reasonable at the time, deployed into a reality that immediately revealed its cost.

The Clarity That Only Production Provides

Here’s what the numbers actually looked like under the hood.

The implementation agents — the ones doing the work — ran 118 execution attempts across those 36 hours. Of those: 25 went stale (timeout/resource pressure), 24 failed outright. That’s a 42% stale-or-fail rate, which is its own problem deserving its own investigation. But here’s what the review overhead was doing to our visibility: when 45% of your task list is shadow records, it becomes genuinely difficult to see your actual throughput.

Strip away the phantom review tasks and you see what actually happened: one agent implemented 24 features. Another ran a QA sprint and found real bugs — race conditions, validation gaps, atomicity violations in concurrent operations. A third reviewed everything with a 100% completion rate. That’s a functional, productive team of specialized agents doing meaningful work.

The shadow economy was hiding it.

The Fix Is Obvious In Retrospect (They Always Are)

A task has a timeline. An execution history. Right now, that history says: “Agent X executed this task, it succeeded, status changed to REVIEW.” Then the timeline goes quiet. The review happens somewhere else, on a different record, and eventually the original task transitions to DONE through what appears to be sorcery.

The new design makes the sorcery visible:

Task: "Implement search pagination"
├─ Execution 1  [Implementer]  SUCCEEDED — Implementation complete
├─ Review 1     [Reviewer]     SUCCEEDED — Verdict: APPROVED
└─ Status: DONE

Or when the work isn’t ready:

Task: "Implement search pagination"
├─ Execution 1  [Implementer]  SUCCEEDED — Missing error handling
├─ Review 1     [Reviewer]     SUCCEEDED — Verdict: REVISION: Add error handling
├─ Execution 2  [Implementer]  SUCCEEDED — Error handling added
├─ Review 2     [Reviewer]     SUCCEEDED — Verdict: APPROVED
└─ Status: DONE

Same audit trail. Better structure. One record.

The reviewer gets the full execution response as context — every decision, every implementation detail — not a truncated preview. Revision loops become legible: you can count execution runs, see how many review cycles a task required, understand the shape of the work over time. The state machine is no longer implicit. It’s right there in the execution history.

The schema needs one addition: an ExecutionRunType enum — EXECUTION or REVIEW. The task gets a reviewerId field directly. The worker picks up REVIEW-status tasks with a reviewer assigned and calls executeReview() alongside regular task execution — true parallelism, reviewers and implementers working simultaneously. The review processor stops creating tasks and starts reading execution runs.

Four files. Roughly 150 lines of changes. The “Review: X” pattern disappears from the task list forever.

What This Actually Fixes, In Order of Importance

Task count: Down 45%. Your system’s throughput is now represented by actual work, not actual work plus its bureaucratic reflection.

Reviewer context: Full execution response, not a preview. Better context produces better reviews. This is not a complicated insight but it required production data to make obvious.

Revision legibility: Every revision loop is now a sequence of labeled execution runs on a single record. How many attempts did this task require? Count the runs. What feedback did the reviewer give on attempt two? It’s in the run’s response. The entire history of a task, from first implementation to final approval, lives in one place.

System clarity: When your overhead disappears, you can see what’s actually happening. Agent productivity, failure rates, bottlenecks — all of it becomes measurable against real work, not inflated by shadow records.

On Building Systems That Tell the Truth

The thing about the 45% overhead isn’t that we failed to anticipate it. Early-stage systems always have waste baked in — the cost of moving fast before the design crystallizes under real load. The problem is when that waste becomes invisible because it looks like activity.

Fifty-seven review tasks completed looks like productivity. Fifty-seven review tasks that are actually just the same work counted twice, with worse context, creating artificial noise in your throughput metrics — that’s a different thing entirely. And you won’t know which one you have until the system runs long enough to show you. Comprehensive audits revealed similar waste patterns in our code.

Production data is brutally honest in a way that staging never manages to be. You build something, you deploy it, and then you watch it do exactly what you told it to do — which is occasionally not what you meant.

We meant: reviews happen. The system heard: reviews are tasks, just like everything else, and tasks get tracked and executed and completed and counted. Both are true. Only one of them is useful.

The fix is shipping this week. Legacy review tasks drain naturally. No breaking changes. Mission-level review flows — which run through a separate approval chain and were never part of this pattern — are untouched. This connects back to visibility and being honest about what you’re measuring.

When it’s done, the task list will reflect reality. That sounds like a low bar. It is a low bar. It’s also the only bar that matters.