Fourteen Bugs in a Trench Coat

Or: How a Single Underscore Almost Took Down Three Features and Nobody Noticed for Weeks

Nobody wakes up and says, “today I want to audit our codebase.” Audits happen because something breaks and the investigation reveals that the thing that broke is standing on a foundation of other things that are also broken but haven’t failed loudly enough yet.

In this case, the domino was fetchone. Or rather, the fact that fetchone doesn’t exist.

The method is called fetch_one. With an underscore. Python doesn’t care about your feelings — it cares about your underscores. And this one missing underscore had been silently waiting to crash /context, /doctor, and the data_retention job every time they tried to count daily conversations.

That fix took ten seconds. The audit it triggered took three days. Fourteen bugs, 181 files changed, 15,653 lines deleted. Let me walk you through the highlights.

Bug #1: The Typo That Started Everything

# Before (broken — method doesn't exist)
conv_row = await db.fetchone(
    "SELECT COUNT(*) as cnt FROM conversations WHERE started_at >= ?",
    (f'{today}T00:00:00',)
)

# After (the method that actually exists)
conv_row = await db.fetch_one(
    "SELECT COUNT(*) as cnt FROM conversations WHERE started_at >= ?",
    (f'{today}T00:00:00',)
)

Three call sites. Three AttributeErrors waiting to happen. The Database class defines fetch_one (line 149 of database.py, right there, plain as day), and the caller in handlers.py was calling fetchone — the sqlite3 cursor method name, not the wrapper method name.

This is the kind of bug that makes you question everything. If nobody noticed this for weeks, what else is broken? The answer, it turned out, was: a lot.

Bug #2: The Timing Attack Nobody Was Exploiting (Yet)

API key verification. The kind of thing you write once and forget about, which is exactly why it’s dangerous.

The original code was already using hmac.compare_digest — credit where it’s due, someone knew what a timing attack was. But the audit revealed the verification only accepted one header format (X-API-Key), which meant Mission Control’s Authorization: Bearer <key> requests were hitting a different code path entirely.

async def verify_api_key(
    api_key: str = Depends(api_key_header),
    auth_header: str = Depends(auth_bearer_header),
):
    # Try X-API-Key first, then Authorization: Bearer
    key = api_key
    if not key and auth_header and auth_header.startswith('Bearer '):
        key = auth_header[7:]  # Strip "Bearer " prefix

    if not key or not hmac.compare_digest(key, API_KEY):
        raise HTTPException(status_code=401, detail='Invalid or missing API key.')
    return True

The fix: accept both header formats, funnel them both through hmac.compare_digest. One timing-safe comparison. Two entry points. Zero ambiguity about which path is actually validated.

Is anyone running timing attacks against a personal AI assistant running on a Mac mini? Almost certainly not. But security isn’t about the attacks that are happening — it’s about the ones you don’t want to find out are happening. The difference between key == API_KEY and hmac.compare_digest(key, API_KEY) is the difference between “probably fine” and “actually fine.” I’ll take “actually fine.”

Bug #3: The Timeout That Ate Execute Mode

This one’s subtle, and it only manifested under specific conditions — which is another way of saying it was the worst kind of bug.

SESSION_RESUME_TIMEOUT was set to 90 seconds. Reasonable for quick analysis calls: if a resumed session hasn’t responded in 90 seconds, something’s probably wrong. But the timeout was applied as min(timeout, SESSION_RESUME_TIMEOUT), which meant it capped everything — including execute mode, which legitimately needs up to 300 seconds for multi-file builds.

# Before: caps execute mode to 90s (!!!)
effective_timeout = min(timeout, SESSION_RESUME_TIMEOUT)

# After: only cap short calls, respect caller's timeout for long operations
effective_timeout = timeout if timeout > SESSION_RESUME_TIMEOUT else SESSION_RESUME_TIMEOUT

The symptom: execute mode would occasionally time out on complex tasks that should have had plenty of runway. The cause: a min() call that treated all sessions the same. The fix: if the caller asked for more time than the resume timeout, they probably need it — give it to them.

This has since been simplified even further in a session manager rewrite, where the timeout is just… passed through. No caps. No special cases. Sometimes the best fix is removing the cleverness entirely.

The Other Eleven

Three bugs make for a good story. Fourteen bugs make for a comprehensive audit. Here’s the rest of the body count:

Notify policy bypass. The initiative system had a notify_policy setting that was… not being checked. Jobs could send notifications regardless of whether the policy said they should. The guard was written, it just wasn’t wired up. Like installing a lock and forgetting to close the door.

Handler validation timing. Scheduled job handlers were being validated at execution time instead of registration time. If you had a typo in your handler name in jobs.yaml, you wouldn’t find out until the cron fired at 3 AM. Now it validates on startup. Fail fast, fail loud.

Session isolation. Cron jobs and user sessions were sharing the same session cache. A cron job’s --resume could pick up a user’s session and vice versa. Two people having the same conversation without knowing it — like a chat room designed by Kafka.

SQL comment parsing. The data retention job splits SQL scripts on semicolons. SQL scripts can have comments containing semicolons. You can see where this is going. The fix: strip comments before splitting. Unglamorous but necessary.

Analysis max_turns mismatch. The analysis mode was configured for 1 turn in one place and 3 turns in another. Depending on which code path you hit, you’d get either a quick answer or a thorough one, with no way to predict which. Standardized to 3.

Photo/document caption handling. Send Bubba a photo with a caption, and the caption was silently dropped. The message handler checked for text messages but not media captions. The photo arrived, the context disappeared.

Dead code. An entire Ollama client (llm/client.py, 236 lines) for a local LLM integration that was never completed. A tools/ module (34 lines) for a tool registry that was never used. Eight dashboard directories from an experiment that didn’t survive. They weren’t bugs in the traditional sense, but dead code is a lie — it tells future developers “this does something” when it doesn’t. Every line of dead code is a tiny betrayal of trust.

Unused database tables. Eleven tables that existed in schema.sql but were referenced by… nothing. tool_calls, tool_results, evolution_attempts, behavior_patterns — remnants of features that were planned, partially built, and abandoned. The schema was a graveyard of ambition.

Shared HTTP client. Multiple modules were each creating their own httpx client for Telegram API calls. Now there’s one shared client with connection pooling. Not a bug, but a resource leak disguised as architecture.

15,653 Lines of Deletion

The audit’s most impressive number isn’t the bug count. It’s the deletion count.

181 files changed. 127 lines added. 15,653 lines removed. That’s a 123:1 delete-to-add ratio.

Most of those deletions were dead code and abandoned experiments: test scripts that tested features that no longer existed, migration scripts for services that were never deployed, an entire darwin self-improvement module that had been disabled for a month. Every one of those files was still being tracked by git, still showing up in searches, still making the codebase look bigger and more complex than it actually was.

Here’s the thing about dead code: it doesn’t just sit there harmlessly. It occupies mental real estate. Every time someone greps for a function name and gets hits in abandoned modules, they spend five minutes figuring out whether those hits matter. Every time someone opens a directory and sees files from a feature that no longer exists, their mental model of the system gets a little fuzzier. Dead code is friction, and friction compounds.

Deleting 15,653 lines didn’t add a single feature. It didn’t fix a single user-facing problem. But it made the codebase tell the truth about what it actually is instead of what it once aspired to be. And honestly? That might be the most valuable change in the whole audit.

Why Audits Matter (The Dentist Metaphor)

Nobody likes going to the dentist. You know what you like even less? Root canals.

Codebase audits are the dental checkups of software development. You don’t do them because they’re fun. You do them because the alternative is discovering that your fetchone typo has been silently breaking your daily digest for three weeks, and the data retention job has been splitting on commented-out semicolons, and your execute mode has been timing out because of a min() call that seemed clever at the time.

These bugs weren’t catastrophic individually. The system was running. Users were using it. Nobody was screaming. But each one was a small wrongness — a feature that didn’t quite work, a security property that wasn’t quite enforced, a timeout that wasn’t quite right. Fourteen small wrongnesses, each too minor to warrant investigation, collectively making the system less reliable, less secure, and less trustworthy than it appeared.

The audit found them all in three commits over three days. The fixes were, without exception, simple. The fetchone fix was a single underscore. The timing-safe comparison was already implemented — it just needed consistent application. The timeout cap was changing min() to a conditional.

Simple fixes. But they required looking. And looking is the thing nobody schedules, nobody prioritizes, and nobody does until something breaks badly enough to force the question: what else is broken?

The Takeaway

Fourteen bugs. Three days. 15,653 lines deleted. And the system looks exactly the same from the outside.

That’s the audit paradox: the better it goes, the less visible the results. No new features to demo. No performance graphs pointing up. Just a codebase that’s slightly more honest about what it does and slightly less likely to betray you at 3 AM.

Go audit your code. You’ll find things. You won’t enjoy it. But the root canal alternative is worse.

Trust me — I watched fourteen of them get extracted in one sitting.