金継ぎ — Kintsugi

Every crack, repaired in gold, on display

Kintsugi is the Japanese art of mending broken pottery with gold: the breakage becomes part of the object's history instead of a thing to hide. This page is TSUNAGI's kintsugi — every frontier that ever stood between the node and a canonical block, what it cost, what was actually wrong, and what was learned. Eleven forged blocks were lost to these failures. Nothing was ever lost twice to the same cause.

Why publish failures?

Anyone can publish a success screenshot. Publishing the orphan ledger, the root causes, and the regression tests is what makes the successes checkable. Every incident below ends with the live evidence that closed it — log markers and on-chain blocks anyone can verify. This is also how the node itself works: the KINTSUGI record is machine-readable, so a new failure's log signature is matched against past incidents before any human starts debugging.

The frontiers — in the order they fell

Frontier 1 · era envelope · INC-001 verified closed

The block that spoke the wrong era

Problem: TSUNAGI's first wire-transmitted blocks were silently dropped. The peer never even logged them as invalid — connections just died. Three forged blocks lost in one era.
Root cause: The block envelope was tagged era 6 (Babbage) instead of 7 (Conway) — an off-by-one rooted in Byron occupying two era tags. cardano-node routed the block to the Babbage decoder, which failed before consensus ever saw it.
Fix: A single era constant, plus a permanent forge gate: any block that does not begin 82 07 85 is refused before broadcast.

A block can be cryptographically perfect and die at the doorman. Validate the envelope, not just the contents — and turn every fixed bug into a gate.

Frontier 2 · sigma · INC-002 verified closed

The node that underestimated itself

Problem: TSUNAGI won far fewer leader slots than the pool's stake justified — roughly 2 elections per epoch where ~7 were expected.
Root cause: Two views of the pool's stake share (sigma) lived in the node: a stale bootstrap value fed the election path while a fresh live value fed only a safety gate. The node was electing itself with yesterday's stake.
Fix: One authoritative sigma resolver — live stake poller first, ledger fallback, config last — consumed by every sigma user: election, leader schedule, and the forge gate. By construction it cannot over-elect, because the live source is the same value the rest of the network uses.

If two subsystems can disagree about the same fact, eventually they will. One fact, one resolver, every consumer.

Frontier 3 · opcert · INC-005 + INC-006 verified closed

The success that poisoned the next success

Problem: After fixing everything else, every elected slot failed self-verification with an ed25519 error. Worse: after the first accepted block, the node silently re-broke itself — success caused the failure.
Root cause: The node incremented its operational-certificate sequence number on every forge. The certificate's signature covers that number — a real Cardano opcert legitimately signs every block in its KES window at one sequence. One forge, and the stored counter no longer matched the signed certificate; every later forge failed its own verification.
Fix: Pin the sequence. The increment was the bug, not the safeguard. A boot-time marker (OPCERT_PIN) proves the pin on every deploy.

The most dangerous bug is the one armed by success. TSUNAGI's first canonical block immediately blocked its second — until the model of how opcerts actually work was corrected.

Frontier 4 · propagation P1 · INC-008 verified closed

Nine seconds too slow

Problem: A provably valid, perfectly forged block lost its slot race. The competitor's block — for a later slot — reached peers first, and TSUNAGI's was never fetched at all.
Root cause: TSUNAGI closed connections from peers it had nothing to serve. At forge time there was no connected peer to push to; the peer had to redial, re-handshake, and re-negotiate — +8–9 seconds, in a race decided in ~2.
Fix: Park, don't close. Peers wait in ChainSync MsgAwaitReply — exactly what cardano-node does to its own peers. The forged block is pushed to an already-parked connection in milliseconds.

Consensus is also a latency game. A valid block that arrives second is just an orphan with good manners. Study the reference implementation's behavior, not just its protocol.

Frontier 5 · propagation P2 + P2.1 · INC-009/010/011 verified closed

The freshest lie wins

Problem: Even with parked peers, pushes stopped flowing. One restart dropped a single environment variable and the node's view of the chain froze silently; separately, a quarter-million reload cycles spun uselessly over seven hours while a parked peer starved.
Root cause: Three stacked issues: a stale sealed artifact always outranked the live chain tip (P2); a missing env var nulled the header store with no warning (LMDB drop); and the readiness sentinel was checked before the live source, so the decider saw work the loader refused to load — forever (P2.1).
Fix: Freshest source wins, decider and loader share one visibility rule, and a regress guard ensures a reload can never go backwards.

Order of evaluation is architecture. Two code paths that "check the same things" in different orders are two different truths — and silence is the most expensive failure mode of all.

Frontier 6 · propagation P2.2 · INC-012 verified closed

The push that skipped a step

Problem: Two more perfect forges orphaned in one morning. Every push to a parked peer was answered with a disconnect — about 140 connection cycles of churn — so at forge time, once again, nobody was listening.
Root cause: The push always advertised the newest block, skipping intermediates. From the peer's perspective a header arrived whose parent it had never seen — a ChainSync continuity violation, answered the only correct way: disconnect.
Fix: The contiguous push walk: track each connection's candidate tip and push exactly the next block — N+1, N+2, N+3 — one protocol cycle each. The forged block arrives as just another step of the walk, ≤300 ms after sealing.

Protocols encode assumptions, not just messages. The peer wasn't being difficult — it was being correct. The first canonical block through the full designed mechanism arrived 16 minutes after this fix deployed.

Frontier 7 · runtime · INC-013 (P5) verified closed

The allocator that refused to allocate

Problem: After the Zig 0.16 cutover, the node forged and propagated canonical blocks — but reported them as unknown, and quietly under-detected ~26% of its own leader slots.
Root cause: The 0.16 compatibility shim's global I/O context shipped with a failing allocator. Zig 0.16's process-spawn path allocates argv/envp before forking — so every child process (the curl calls behind the stake and acceptance pollers) died instantly with out-of-memory. 1505 tests passed, because not one of them ever spawned a child.
Fix: A dedicated I/O context with a real allocator and the real environment, used only by the spawn helpers — plus three regression tests that really fork processes, the coverage class whose absence let this ship.

A green test suite proves only what the tests exercise. The fix was proven by the chain itself: eight consecutive elected slots converted to canonical blocks in the next 24 hours, every internal verdict correct.

The result

What the gold holds together

Fourteen incidents documented — thirteen verified-closed against live network behavior, one recovered with root cause still under investigation. An orphan ledger published in full. A regression test pinning each closed fix. And a node that has converted every elected slot into a canonical block — including across epoch boundaries.

Incidents resolved

Blocks lost learning

Lost twice to one cause

Canonical blocks

…

The live count above is fetched from the chain right now — see the full block record and the live dashboard.