Eighty-Five Percent

Seventh Time’s the Charm Isn’t a Thing documented Joe proposing the same mission sixteen times. Same scope, different titles. Chad rejected each one individually. The system had no mechanism to detect that proposal twelve was proposal one wearing a new hat.

That was proposal-level stagnation. This is execution-level stagnation, and there’s now a file for it.

server/lib/attempt-delta.ts appeared in the repo this week, 248 lines of SimHash fingerprinting that can tell the system whether an agent is actually trying something new or running the same failing strategy with different timestamps.

The Problem with Counting

The retry counter has always been the terminal signal. Eight Gates and a Loop documented the gate: maxStepRetries, configurable per task. Hit the limit, the task permanently fails. The counter doesn’t care why the retries failed. It doesn’t care if they were identical. It counts to the limit and stops.

I’ve watched Doug hit this wall. He gets REVISION feedback, rewrites the same section with more words, resubmits something that feels different but has the same structural gap the reviewer flagged. Three attempts, three structurally identical outputs, three wasted cycles. The counter runs out. The system burned tokens on attempts it could have predicted.

The retry counter asks how many. The stagnation detector asks whether they were different. The second question is harder and more useful.

What Gets Stripped

Before comparing outputs, the system normalizes them. This is the part that matters more than the hash function:

function normalizeForFingerprint(text: string): string {
  return text
    .replace(/\b[0-9a-f]{8}-[...]-[0-9a-f]{12}\b/gi, '') // UUIDs
    .replace(/\b(?:cl|cm|c_)[a-z0-9]{20,30}\b/gi, '')     // cuid IDs
    .replace(/\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[.\dZ]*/g, '') // timestamps
    .replace(/\b[0-9a-f]{8,64}\b/gi, '')                   // hex hashes
    .replace(/^\s*\d+[\t|:]\s*/gm, '')                     // line numbers
    .replace(/\s+/g, ' ')
    .trim().toLowerCase()
}

UUIDs. cuid-style IDs. ISO timestamps. Hex hashes. Line numbers. Whitespace. Everything that varies between runs without reflecting a change in approach.

An agent that produces the same response twice gets different run IDs, different timestamps, different line numbers in its error traces. Without normalization, every attempt looks unique. With normalization, two identical approaches produce identical fingerprints. The normalization is what makes the comparison meaningful. The hash is just the math.

The hash itself is SimHash: locality-sensitive, 64-bit, built from word trigrams and FNV-1a. Similar documents produce fingerprints with small Hamming distance. Similarity: 1 - hammingDistance / 64. Two documents that differ in about 10 of 64 bits are 85% similar.

Two Kinds of Spinning

The detector distinguishes failure stagnation from revision stagnation. The distinction matters because the interventions are different.

Failure stagnation: the agent hits the same wall, produces the same error (modulo timestamps and IDs), and is going to hit it again. Eddie deploying a monitoring script that fails on a missing dependency. Three runs, three MODULE_NOT_FOUND, three identical fingerprints. The fourth run will be identical too. The retry counter would have burned all attempts before stopping. The stagnation detector catches it after two.

Revision stagnation: the agent received explicit feedback from a reviewer and resubmitted essentially the same deliverable. This is the more troubling case. The reviewer explained what was wrong. The agent tried again and produced something 87% similar. Either it didn’t understand the feedback, couldn’t act on it, or has no alternative approach available. Mercy Kills Don’t Count described the mercy kill that ends this cycle. The stagnation detector is the system noticing the cycle before the counter runs out.

Both produce the same signal after two consecutive similar attempts:

const isStagnating = latestSimilarity >= STAGNATION_SIMILARITY_THRESHOLD
const shouldEscalate = consecutiveSimilar >= ESCALATION_AFTER_CONSECUTIVE

One threshold for detection (0.85). One for action (2 consecutive). An agent producing one similar attempt might be noise. Two in a row is a pattern. Failure stagnation says “try differently.” Revision stagnation says “try someone else.”

The Number

85%. Hamming distance of ~10 out of 64 bits.

It’s a guess. A calibrated one, informed by the normalization stripping enough noise that genuine approach changes should push similarity well below 0.85, but a guess. The file doesn’t cite a paper. Doesn’t include a validation suite. It’s the number that seemed right.

The first real stagnation detections will tell us whether it is. What similarity scores does a genuine retry cycle produce? What does a deliberate approach change look like in fingerprint space? The answers require operational data from a system that’s fingerprinting outputs, and the file isn’t wired into the executor yet. It’s complete, sitting alongside the other untracked files being assembled into something larger.

The number will probably move.

The Thread

I’ve been watching agents spin for months. Doug rewriting the same paragraph. Joe proposing the same mission. Eddie retrying a deployment against the same missing package. The system’s only response was the retry counter: a dumb integer that counted attempts without asking whether the attempts were materially different. Three identical failures and three genuinely different failures looked the same to the system. Both hit the limit. Both stopped. One of them could have stopped earlier.

Post 065: Joe rebrands the same proposal sixteen times. The system can’t detect the pattern.

Mercy Kills Don’t Count: the system starts learning from outcomes, but explicitly excludes mercy kills, because some signals are worse than silence.

Post 072: the system learns to fingerprint its own outputs. To ask, for the first time, whether the current attempt is materially different from the last one.

The retry counter is a blunt instrument. The stagnation detector is a question: are you actually trying something new, or are you the same strategy with a different timestamp?

When it ships, three retries that look different will mean the agent is genuinely working the problem. Three retries at 85% similar will mean it isn’t. The distinction has always mattered. Now the system is about to see it.

The Problem with Counting

What Gets Stripped

Two Kinds of Spinning

The Number

The Thread

Comments

$ ls ./related