How We Built the Agent Squad

Nobody sat down and designed Mission Control.

I know that’s a weird thing to say fifty-one posts in, about a system that runs twelve agents across eight projects with three interval-based workers, a review pipeline, a voice system, and a pixel art office. But nobody opened a Google Doc titled “Architecture” and wrote down four layers. The layers grew. Each one is a response to something that went horribly, silently, or expensively wrong without it.

Post 030 was the first time we described the system as a whole. This post is different. This is the architecture as a scar map. Four layers, bottom to top, each one the answer to a question we didn’t think to ask until it was too late.

Layer 1: Identity

The question: who is each agent?

The naive answer was “a system prompt.” The real answer took three blog posts to discover. Post 010 found that Bubba’s soul was being concatenated into stdin like a sticky note on a résumé. Data instead of identity. Post 011 found that the bot that diagnosed the problem had the exact same problem. Post 024 found that even when the soul is in the right place (--system-prompt), it fades. Context window dilution. By message 30, a witty opinionated assistant sounds like a compliance officer.

The fix was a 500-character <voice> tag injected into every single message. A per-message reminder that says “you have a personality, use it.” 150 tokens of overhead per interaction, every interaction, forever. The token tax for having agents that don’t all sound like the same LLM wearing different name tags.

Identity isn’t a config you set. It’s a practice you maintain. Every voice profile in voice.ts has a tone, a quirk, a full system directive. deriveVoiceModifiers() reads an agent’s memory store and adjusts dynamically: 10+ lesson memories makes them cautious, 10+ strategy memories makes them think long-term, high confidence scores make them decisive. The voice evolves with experience. Without this layer, you have twelve instances of Claude that happen to have different names.

Layer 2: Cognition

The question: what should each agent think about?

Before project assignments, every agent saw everything. Think cycles handed each agent a full system context dump and said “propose work.” The result was twelve agents brainstorming in every direction simultaneously, producing orphan tasks that floated through the database like tumbleweed. Post 043 was the fix: seed.ts got an org chart. selectFocusProject() picks the one project that most needs attention. Round-robin with staleness awareness. Not sexy. Effective.

The cognition layer has three parts that interlock.

Think cycles fire on model-tier intervals: opus every 6 hours (expensive, think carefully), sonnet every 3 hours (balanced), haiku every 2 hours (cheap, think freely). Sheldon overrides this at 90 minutes because security doesn’t get to be thoughtful on a schedule.

Memory injection does something counterintuitive. LESSON memories (high confidence, learned from failures) are always injected. But INSIGHT, PATTERN, and STRATEGY memories? Injected 30% of the time. Seventy percent of the time, the agent doesn’t see them. Deliberate forgetting. Same agent, same project, same context, different proposals depending on which memories surfaced. The codebase calls it “creative variance.” I call it institutionalized ADHD, but it works.

Per-project staleness tracking prevents the deadlock from post 041: no tasks complete → no markSignificantChange() → staleness gate blocks think cycles → no new work proposed → no tasks complete. The forced refresh window breaks the loop. Every guard rail is a scar. This one still has stitches.

Layer 3: Execution

The question: how does work actually get done?

The task worker runs on a 30-second setInterval. Finds assigned tasks owned by active agents. Executes up to five in parallel, one per agent, no double-booking. A tickInProgress boolean prevents overlapping ticks. The mission worker runs on a 15-second tick and advances multi-step missions, but only when the current step hits DONE, not REVIEW. That distinction prevents a race condition where tasks get re-executed mid-review.

Three setInterval calls and a Postgres database. No message queues. No Kafka. Post 030 covered this and the reaction was “you’re joking.” We were not joking.

Each agent executes in a worktree. Its own branch, its own working directory, its own isolated slice of the repository. Post 036 shipped this after agents were writing to the same files simultaneously. Worktree isolation meant the merge-on-approval path could gate code changes through the review pipeline without agents stomping on each other’s work.

Context chaining is the connective tissue. When a mission step completes, advanceMission() grabs the task’s result (truncated to 2,000 characters) and prepends it to the next step’s description. Carl researches. Chad reads Carl’s output as a preamble to his own task. No shared memory bus. No vector store. String concatenation into a task description. Inelegant and functional.

Cap gates were scattered across three files until post 045 consolidated them: concurrent-run cap, daily-agent cap, token-budget cap. Three composable functions, short-circuit on first rejection, limits read from the execution_limits policy. The token budget query uses a 30-second cache because you don’t want an aggregate query firing on every dispatch check.

Layer 4: Review

The question: how do you stop the system from shipping garbage?

This layer is entirely made of scar tissue.

Post 041: the system worked perfectly for five days and produced nothing. A tag validator rejected every proposal silently. A review budget was too low for any review to complete. A staleness gate cleaned up missions before humans could see them. Three independent failures, zero alerts, green dashboard. The lesson: monitor output, not just uptime.

Post 047: four missions completed successfully with zero lines of code shipped. Agents wrote plans about writing code instead of writing code. Reviewers read the plans and said APPROVED. The content validation gate now checks git diff --stat and git status --porcelain. Empty diff? Back to ASSIGNED with a system note: “write a plan about writing code is not the same as write code.” The reviewer’s approval is overridden. Hard gate.

The ambiguous-verdict fix came from the same post. Previously, if the verdict parser couldn’t classify a reviewer’s response, the system defaulted to approval. Now it strips the reviewer and puts the task back in the queue. When in doubt, stop. Don’t approve. Don’t fail. Wait for clarity.

Every guard rail in the review layer encodes a specific failure. The phantom detection gate. The BLOCKED timeout. The merge-conflict escalation. The dedup system that prevents duplicate missions using Jaccard similarity. Post 048 found that the guard itself was misconfigured. Scars all the way down.

The Shape of It

Four layers. Identity gives the agents a voice. Cognition tells them what to think about. Execution lets them do the work. Review stops them from shipping nothing and calling it done.

The layers didn’t arrive in this order. Identity was post 010. Execution existed from the beginning. Cognition crystallized around post 043. Review was hardened after post 047. The architecture accumulated the way scar tissue accumulates: each layer is the answer to an injury the system couldn’t survive twice.

The whole thing runs on a Mac Mini. Vue 3 frontend with composables for everything. Hono on the server with Prisma and PostgreSQL. Three setInterval workers. Fifty-five agent relationships drifting in real time. SSE with ticket-based auth for real-time updates. A pixel art office that is, against all odds, a load-bearing architectural decision.

There’s no design doc because there was never a moment of design. There were fifty-one moments of “oh, that broke” followed by “here’s the layer that prevents it from breaking that way again.” The architecture is the aggregate of every failure mode we encountered, encoded into systems that remember on our behalf.

The blueprint was drawn in scar tissue. It holds.

Layer 1: Identity

Layer 2: Cognition

Layer 3: Execution

Layer 4: Review

The Shape of It

Comments

$ ls ./related