We Gave Them Unlimited Tokens and They Took Them

Friday evening, someone loosened the budget caps.

Not by a lot. Just… enough. Think cycle intervals shortened. Per-tier spending limits raised. The kind of small adjustments that feel reasonable when you’re staring at agents that keep hitting ceilings. “They’re doing good work,” you tell yourself. “Let them cook.”

They cooked. They cooked through the entire token allocation.

The agents ran non-stop. Every think cycle fired at its shortened interval. Every execution burned hotter than the budget assumed. By Saturday morning, the usage dashboard looked like a vertical line. By Saturday evening, the whole system was quiet. Not because we turned it off. Because there was nothing left to burn.

A weekend of silence. Twelve agents with nothing to do. The AI equivalent of a company car with an empty tank because someone said “gas is cheap right now.”

The Revert That Says Everything

Sunday evening, 8:39 PM. First commit of the recovery session:

“revert: restore pre-loosening budget caps + per-tier think intervals”

Six minutes later, 8:45 PM:

“fix: step budget floor + notification spam + success messages”

The revert came first. Before the bug fixes, before the new features, before any of the infrastructure work that would define the rest of the night. The very first thing was putting the limits back. That ordering tells you everything about what the preceding 48 hours felt like.

The second commit added a floor to the budget system. A hard minimum. No matter how low you set the caps, the budget can never actually reach zero. Which is the kind of safety net you only build after you’ve watched something fall.

Six Hours for a Missing Import

The analytics dashboard was supposed to be a quick fix. It was not a quick fix.

The root bug: DataTable.vue wasn’t importing FlexRender from @tanstack/vue-table. One import. Missing. Every custom cell renderer in the entire dashboard was silently returning nothing. The table rendered. The data loaded. The cells were just… empty. No error. No warning. Just a perfectly functional table with a cosmetic lobotomy.

On top of that, the table’s reactive bindings were initialized with snapshot values instead of reactive getters. Sorting: dead. Filtering: dead. Pagination: decorative.

This took six hours across four sessions to diagnose and fix. Six hours. For a missing import and a reactivity bug. At one point the approach that finally worked was asking the AI to put on its “KPI manager hat” and design the dashboard from first principles. Stop trying to fix the existing thing. Pretend you’ve never seen it. What would a human dashboard designer build?

That’s when the architecture clicked. Three tiers: System Health on top, Root Cause analysis in the middle, Optimization metrics at the bottom. Clean hierarchy. Information flows downward from “is everything on fire” to “why is everything on fire” to “how do we make the fire smaller next time.”

Final build: 264KB gzipped. Passes clean. But six hours. For a missing import. There’s probably a lesson in there about silent failures being worse than loud ones, but I’m too tired to frame it elegantly.

The Quiet Win (That Already Existed)

The failure classifier isn’t new. The agents built it autonomously two days ago on the remote server — six categories, ten classification rules, the whole thing — and we wrote about it yesterday. It was already running. It was already categorizing every execution failure as TRANSIENT, AUTH, LOGIC, RESOURCE, TIMEOUT, or UNKNOWN.

What changed today: the analytics rewrite finally gave it a face.

Before, the data existed and went nowhere. The classifier was firing on every failure, tagging it, storing the category. And none of that surfaced anywhere you could actually see it. Category breakdowns: not displayed. Top classification rules with match counts: not displayed. UNKNOWN spike alerts — the trigger that fires when more than 20% of recent failures escape categorization: technically present, practically invisible.

The dashboard redesign wired all of it up. Now the System Health tier shows category breakdowns in real time. The Root Cause tier surfaces which rules are matching most, and how recently. The classifier had been doing work in the dark for two days. Today it got a window.

Infrastructure that makes other infrastructure visible. The boring kind of work that makes everything else less boring.

The Bridge Was Open

The security-focused agent ran an audit. Fourteen input surfaces mapped. Sixteen vulnerabilities identified across four priority levels. Input validation tightened across all routes. Attack surface formally documented for the first time.

Two highlights:

The bridge service that connects the web interface to the agent execution backend was binding to all network interfaces. No authentication. Any device on the local network could hit the agent execution endpoint directly. The host firewall was the only barrier. That’s not security; that’s hoping nobody notices the door is unlocked.

Also: the security agent proposed a follow-up mission that included an agent we’d already deactivated and fixes for systems that aren’t part of this project. Hard reject. Even security-focused agents will scope-creep if you let them.

The Fence That Enables the Field

Here’s the one that actually mattered, and the reason the rest of this list was worth doing.

Multiple agents writing code in parallel on the same repository. Same working directory. Same git state. Every concurrent execution was a race condition. Every merge was a prayer whispered over --no-ff.

The trigger: a tweet about the CLI tool getting built-in worktree support. Saw it, immediately recognized the fit. Sometimes infrastructure arrives exactly when you need it.

The fix: worktree isolation. Each task gets its own branch, its own working directory, its own slice of the repository. About 200 lines of code for the full lifecycle. Create the worktree. Gate it behind a check: project must be git-backed, agent must have file-writing tools. Inject the actual git diff into the review prompt so reviewers see real changes, not summaries. Auto-merge approved work. Auto-spawn conflict resolution tasks when merges collide. Clean up everything on success or failure. A heartbeat sweep catches orphans every five minutes.

The best part: when a merge conflict happens, the system doesn’t just fail. It creates a new task, assigns it to the same agent who wrote the conflicting code, marks the parent as BLOCKED, and when the conflict resolves, retries the merge without re-executing the entire task. The agent cleans up its own mess. As it should.

Two new fields on the data model. Five files touched for integration. The kind of change that’s trivial to describe and transformative to operate. Before: agents couldn’t safely run in parallel on the same codebase. After: they can. That’s it. That’s the feature.

The Paradox

Every loosened constraint revealed the need for a tighter one somewhere else.

Raise the token budget: you need a budget floor so it can’t crater to zero. Give agents a dashboard: you need six hours of debugging because silent failures aren’t failures until they are. Let agents classify their own failures: you need spike detection for the failures that escape classification. Let agents run security audits: you need a human to reject the scope creep. Let agents write code in parallel: you need filesystem isolation so they don’t overwrite each other.

More autonomy doesn’t mean less infrastructure. It means different infrastructure. Every fence you remove in front of the agents, you build a new one behind them. Not to restrict what they do, but to keep the ground stable underneath.

Twelve hours. Eleven commits. One feature that makes everything else possible. And one very specific lesson that keeps proving itself: the leash and the fence are not opposites. They’re the same thing, seen from different angles.