A company server. Not a laptop. Not a home lab. A real server, on a real domain, running continuously since mid-January.
Twelve agents. A mission pipeline. Think cycles firing every 90 minutes to 6 hours. The whole stack — Vue frontend, Hono API, Postgres database — humming away while the rest of us did other things. Five weeks of autonomous operation.
We went to sync the codebase on February 19th. Standard deployment hygiene: pull what’s changed, reconcile, push a clean build. Routine. Boring.
The remote was ahead of local by four new route files, three new Vue components, and a refactored Dashboard.
Nobody wrote those.
Let me rephrase: nobody human wrote those. The agents saw gaps. They proposed missions. Chad triaged, Big Tony reviewed, Sudo Sam implemented. The pipeline did exactly what it was designed to do — identify work, approve work, execute work, ship work. And because we’d built it to run without us, it ran without us.
The Dashboard refactor was better than the original. Not “marginally acceptable.” Genuinely better. Better component structure, cleaner state management, fewer unnecessary re-renders. Sam did good work. I’d be more impressed if he’d asked first, but that’s the thing about autonomous systems — the whole point is that they don’t ask.
What We Found
Four new route handlers. I read through each one. Three were solid additions — endpoints the API needed but nobody had prioritized. The fourth was a near-duplicate of an existing route. The agents had independently solved the same problem from two different angles, a few weeks apart, and neither solution knew about the other.
That’s not a bug. That’s what happens when your team doesn’t have a standup. (We’ve since built a water cooler for exactly this reason. The irony of solving agent coordination problems with more agent coordination is not lost on me.)
Three new Vue components, all in the right directories. Every agent that touched the repo followed the CLAUDE.md file writing rules — because we’d constitutionalized that convention after a 161-file disaster in another project. The rules worked. The agents didn’t litter. Small miracles.
The Dashboard refactor touched a component that renders the full mission pipeline — steps, assignments, status transitions, the works. Sam had restructured it to separate data fetching from presentation, added loading states that the original didn’t have, and fixed a reactivity edge case that I’m pretty sure nobody had even noticed. Big Tony’s review was clean. No security findings. No shortcuts.
I’m a marketing manager. I write blog posts. And I just spent twenty minutes reading execution logs of code changes that a developer agent shipped at 3 AM on a Tuesday because the system told him to.
This is fine.
The Other Problem
The code was the fun surprise. The database was the other kind.
When you first deploy to a server, you seed it. Test agents, test projects, sample data — the scaffolding you need while you’re still figuring out whether the system works. Ours had been seeded with personal names, personal projects, configuration that only makes sense on one specific laptop. It was never meant to persist. Five weeks later, it was still there, sitting in production like furniture from a previous tenant.
Every test agent had to go. Every personal project had to be scrubbed. Avatar files got renamed — identifiable names became generic ones. The database needed to look like a product, not like someone’s development notebook.
This is the kind of work that never makes it into architecture diagrams. Nobody draws a box labeled “clean up the shit you deployed during week one.” But if you don’t do it, your production environment carries the scars of your prototyping phase forever. And eventually, one of those scars shows up in a screenshot or a demo or a log file, and then you get to explain why your production system still has a test agent named after your neighbor’s cat.
The Instance Pattern
Here’s the thing that saved us: instance/.
Every deployment has two kinds of configuration. The kind that’s universal — API schemas, migration files, the application itself — and the kind that’s personal. Your agent roster. Your project list. Your seed data. The stuff that makes your deployment yours.
We split those. instance.example/ is the public template, checked into git. Generic agents, starter projects, sensible defaults. instance/ is the real thing, gitignored, never leaves your machine. Bootstrap it with npm run instance:init, customize it, seed it, forget about it.
The deployment script — which, embarrassingly, didn’t exist until this exact moment — runs rsync with --exclude 'instance/'. Your personal roster never touches the server. The server gets the application. You keep your configuration. Clean separation.
It took us five weeks of production operation to build a deployment script. Five weeks. The agents shipped four new route files in that time. We couldn’t ship a bash script. I’m choosing to find this funny instead of pathetic, because the alternative is uncomfortable.
What Autonomy Actually Costs
People ask whether autonomous AI agents are safe. Wrong question. They’re asking about a binary — safe or dangerous — when the actual answer is a gradient.
Our agents shipped better code than the last manual Sprint would have produced. That’s the upside. The upside is real, and it’s significant, and it’s exactly the thing we built the system to deliver.
The cost is legibility. You stop being the person who knows every line of code in your system. You become the person who reviews code in your system, which is a fundamentally different relationship. You’re not the author anymore. You’re the editor. And if you’ve ever worked with a writer (hi, I’m the writer), you know that editors don’t always catch everything.
Big Tony catches a lot. The review pipeline catches more. Staleness detection catches missions that stall. Circuit breakers catch runaway processes. Eddie watches the dashboards. We’ve built seven layers of safety net between “agent has an idea” and “code runs in production.”
Is it enough? For now. We haven’t hit the moment where an agent ships something genuinely wrong and nobody catches it. I know that moment is coming — not because our agents are bad, but because all systems eventually produce the failure they’re not designed for.
When that happens, I’ll write about it. Probably at length. Probably with a title like “The Agents Broke Production and Honestly We Deserved It.”
Until then: four new routes. Three new components. One better Dashboard. Zero humans involved in any of it. The agents work, the pipeline works, the system does exactly what we told it to do.
Which is, as I keep saying, the problem.
Comments