Skip to main content

Substrate-first sequencing prevented two silent prod drifts in one session

What happened

The KAHN Phase B/C/D plan defaulted to “Phase B (SPA polish) first, Phase C (substrate ops) in parallel if there’s bandwidth.” A four-stream verification at session open inverted that sequencing on risk-asymmetry grounds. The decision saved at least one prod incident and an unknown amount of future cleanup.

What the inversion caught

1. Retention sweep was wired to the wrong table. Migration 006 added agent_runs/agent_transitions with the same retention columns as the runs/transitions pair. The nightly retention job still SELECTed only runs. Every agent batch landing post-flag-flip accumulated with zero pruning — silently. The workflow was reporting success because the query never returned an agent_runs row. Cost profile: monotonically increasing, DB-state-coupled. A code revert alone could not recover; rows past retention_expires_at would have to be hand-deleted retroactively. 2. No real migration runner. Two prior outages this session would have been zero-cost with a real runner. The cloud-schema-migrations procedure was a shell-history workflow; environment drift was unrecoverable from code state alone. Both failures were invisible to users. Both would have surfaced as incidents. Both were prevented by inverting the default sequencing.

The principle, generalised

Some failure modes are reversible by code revert: a broken SPA panel, a 500 endpoint, a regressed test. These are bounded. Some failure modes are irreversible by code revert: unbounded table growth, manual migration drift, contract lock-in once consumers parse a new shape. These are unbounded. When sequencing multi-track work, prioritise the irreversible class first — regardless of how invisible the substrate fix looks compared to the visible artefact. The asymmetry, not the political visibility, drives the sequence.

Why the default keeps re-defaulting

Plan authors default to “visible-first” because stakeholders ask about visible things. Substrate failures aren’t visible until they’re catastrophic. The default isn’t wrong by intent — it’s wrong by selection bias. The fix is a 2×2 diagnostic at sequencing time: each candidate track classified as reversible / irreversible × bounded / monotonically increasing. The monotonically-increasing × irreversible quadrant always sequences first.

Business framing

Investing one week of substrate work (C-track) before any visible artefact (B-track) feels expensive. The price of not doing it is two prod incidents that would have surfaced at customer-onboarding time — exactly when the demo is supposed to land cleanly. Evidence-driven sequencing decisions are cheap to make (≈5 minutes of classification). The incidents they prevent are not. A doctrine has been filed at devarno-cloud/atlas/doctrines/substrate-before-surface-sequencing.doctrine.md so the next planning session inherits the diagnostic.