Confidence 1.00, Seven Times

Twenty.

That’s how many times the system documented the same finding. Between March 3rd and March 8th, our shared project memory accumulated twenty distinct entries all reaching the same conclusion: there are roughly 10 timeout failures, they’re driven by task complexity, and Sam’s failure rate is lower than the 17.9% someone originally cited.

Twenty entries. Same conclusion. Seven different numbers for Sam’s rate. Every single one tagged confidence: 1.00.

The system built a memory to learn from its failures. Instead, it learned the same lesson twenty times.

The Finding That Wouldn’t Stay Found

The actual finding is boring. That’s the point. Ten CLI timeouts in a 14-day window. Seven of them landed on the same day during a single API degradation event. Sam’s failure rate was miscited in a task brief as 17.9%, and the real number was lower. A competent audit would take an hour. Write it up, move on.

The first audit ran March 3rd. A standalone report. 19 timeouts across an all-time window, Sam at 19.6% out of 97 runs. Legitimate work. The second ran March 4th, authored by Uptime Eddie. 21 timeouts, Sam at 11.6% out of 172 runs. The denominator grew because more runs accumulated. Also legitimate.

Then March 6th happened.

Eleven entries in a single day. Eleven agents (or the same agents on different tasks; the file doesn’t attribute most entries) each independently queried the database, discovered that Sam’s rate was actually 6.9% in a 7-day window, and wrote their own version of the finding to memory. Four of them used the header “Timeout Audit Findings.” One used “Timeout Audit.” One used “Timeout Distribution Audit.” One misspelled “TIMEOUT” as “TOUTIMEOT,” which is what happens when you’re the seventh agent to write the same conclusion and even your fingers are bored.

Here’s entry #8, from line 133 of memory.md:

## Timeout Audit (2026-03-06)
- 10 TIMEOUT-classified runs in last 14 days. 7/10 landed on 2026-03-02
  in a single API degradation event
- Sam's previously-cited 17.9% failure rate is NOT confirmed: current
  14-day rate is 6.9% (9/130 runs).

And entry #9, nine lines later:

## Timeout Audit Findings (2026-03-06)
- DB has 10 TIMEOUT failures (actual CLI timeouts) and 13 UNKNOWN failures
  (provider not registered/available on 2026-03-04). Task brief conflated
  the two as "13 timeouts" — they are different failure modes.

Same data. Same conclusion. Different words. Both tagged confidence: 1.00. A human reads these and sees one finding written twice. The system reads these and sees two unique memories.

Sam’s Number, Drifting

The 17.9% figure appeared in a task brief. Its origin is lost. Every subsequent audit re-derived Sam’s actual rate from the database, and the rate dropped because the denominators kept growing while the failure count stayed roughly fixed.

| Date    | Sam's Rate | Runs  | What Changed                    |
|---------|------------|-------|---------------------------------|
| Brief   | 17.9%      | ???   | Origin unknown. Cited as gospel.|
| Mar 3   | 19.6%      | 97    | All-time window. Small denom.   |
| Mar 4   | 11.6%      | 172   | 75 more successful runs.        |
| Mar 6   | 6.9% (×10) | 130   | Switched to 7-day window.       |
| Mar 6   | 6.0% (×1)  | ~149  | Anomalous. Different query.     |
| Mar 7   | 6.2%       | 145   | Denominator growing.            |
| Mar 7   | 6.1%       | ~146  | Still growing.                  |
| Mar 7   | 6.0%       | 149   | Labeled "FINAL."                |
| Mar 8   | 4.8%       | 186   | Filtered test-fixtures. Actual  |
|         |            |       | final. Until the next one.      |

Seven distinct values. Every single one reported with absolute certainty. Every single one framing the previous number as “stale” or “wrong” rather than acknowledging it measured a different window.

The 17.9% to 6.9% drop is not an error correction. It’s a scope change. All-time to 7-day. Both were accurate. But fourteen of the twenty entries explicitly declare 17.9% to be incorrect. They’re correcting a number that was never wrong. It was just old.

Entry #19 declares itself “FINAL” in the header. Entry #20 exists. In a system with no authority mechanism, nothing can be final. The word is just more text for the pruner to ignore.

The Pruner That Can’t Prune This

The system has a dedup mechanism. Three passes. It runs every five minutes. Here’s the pass that’s supposed to catch duplicates:

const toDelete = new Set<string>()
for (let i = 0; i < memories.length; i++) {
  if (toDelete.has(memories[i].id)) continue
  const a = memories[i].content.toLowerCase().trim()
  for (let j = i + 1; j < memories.length; j++) {
    if (toDelete.has(memories[j].id)) continue
    const b = memories[j].content.toLowerCase().trim()
    // One is a substring of the other → near-duplicate
    if (a.includes(b) || b.includes(a)) {
      toDelete.add(memories[j].id)
    }
  }
}

The comment says “near-duplicate.” The implementation says a.includes(b). Those are very different claims.

Compare these three strings, all describing the same finding:

"10 real CLI timeouts total (failureCategory=TIMEOUT). Sam 4, Chad 3..."
"10 classified TIMEOUT runs across 807 production runs (1.2% rate)..."
"10 TIMEOUT runs in 30 days: Sam/opus (4), Chad/opus (3)..."

None is a substring of any other. Same facts, different sentences. Every agent rewrites the conclusion in their own words, and the pruner can’t see they’re saying the same thing. The dedup pass requires textual containment, not semantic similarity. It catches copy-paste. It does not catch understanding.

The other two passes are worse. Pass 1 deletes entries older than 30 days with confidence below 0.5. All twenty timeout entries are under 14 days old and tagged 1.00. Immune. Pass 3 enforces a cap by deleting the lowest-scored memories first, scored by confidence × (1 + accessCount × 0.1). At confidence 1.00, redundant entries score highest. They’re the last to be pruned, not the first.

The system’s most redundant memories are also its most protected.

Two Systems, Neither Watching

Here’s the architectural joke. There are two memory systems. The file-based one (.mission-control/memory.md) and the database-backed one (AgentMemory table). The file has no pruner at all. Pure append. 293 lines, growing forever, 45% of its non-blank content dedicated to timeout findings about a single week.

But the real gap is in the system prompt that tells agents how to use memory:

1. FIRST: Read ${projectPath}/CLAUDE.md for project conventions...
2. THEN: Read ${projectPath}/.claude/memory.md if it exists...
4. AFTER completing work: ...append to ${projectPath}/.mission-control/memory.md
5. Do NOT duplicate information already in CLAUDE.md.

Step 2 says read .claude/memory.md. Step 4 says write to .mission-control/memory.md. The read path and the write path point to different files. Step 5 says don’t duplicate CLAUDE.md. It says nothing about duplicating memory.md. An agent is instructed to write findings to a file it was never instructed to read.

So the loop runs like this: task brief cites 17.9%. Agent queries database. Finds 6.9%. Writes correction to memory.md with confidence 1.00. Next agent gets a related task. Brief still says 17.9%. Agent queries database. Finds 6.9%. Writes correction. Repeat eleven times in one day. Nobody told them to check if the answer was already there. The instruction literally doesn’t exist.

The Consulting Study Problem

Every organization has done this. You commission a study. It concludes something obvious. You file it somewhere nobody checks. Six months later, someone commissions the same study. Different consultants, different wording, same conclusion.

Our agents do it faster and with more conviction. Six days instead of six months. confidence: 1.00 instead of “per our analysis.” And the file that accumulates the findings grows larger with each repetition, which makes it harder to parse, which makes the next agent more likely to skip it and just re-query the database. The documentation becomes evidence for further investigation. Each entry makes the next re-investigation more likely, not less.

What This Means

Post 050 introduced the memory system. Post 055 celebrated the diagnostic arc. This is the shadow side: the system that can diagnose, document, and then immediately forget it already diagnosed.

The gap between “syntactic dedup” and “semantic dedup” is where 115 lines of redundant findings live. Substring matching catches copies. It doesn’t catch comprehension. And until the system can tell the difference between “I haven’t seen this before” and “I’ve seen this before, phrased differently,” every investigation is the first investigation.

Thirty-nine percent of the system’s shared memory is about one topic. The memory system’s biggest memory is something it can’t stop remembering.

Twenty entries. Seven numbers. One finding. Confidence: 1.00, every time. The system doesn’t lack certainty. It lacks the ability to know what it already knows.