Skip to main content

What happened

A user reported casa.devarno.cloud showed an infinite loop after sign-in. Five consecutive PRs shipped, each correctly fixing a real bug: (1) dead BetterAuth /sign-in page removed, (2) request-storm-causing ImpersonationBanner removed, (3) middleware set headers on response instead of request, (4) cookie-read moved from headers() to cookies() for Next 16. All deployed clean. The loop persisted. Only then did a three-line diagnostic endpoint reveal the actual error — Postgres: column "tier" of relation "users" does not exist. Drizzle migration 0007 had been sitting in the repo un-applied for weeks. CASA’s getAuthUser try/catch had been swallowing this exception on every request since TASKSET 7 shipped.

Why the symptom chain was misleading

Every layer had a real bug. The earlier fixes were correct and shipped, they just weren’t this bug. The loop symptom was consistent with all of them. Without data (the probe), pattern-matching on symptoms is a random walk.

The pattern to adopt

When a production symptom is “function X returns null in a hot path” and X has a catch { ... return null } anywhere on its path, stop speculating after one speculative fix. Ship a probe. A three-line route handler that re-runs X in isolation and returns the raw exception body will outperform any number of theory-led patches.

Cost accounting

StepAttemptsTime cost
Theory-led patches5 PRs (all valid bugs, none the root cause)~3 hours
Probe endpoint1 PR~15 minutes
Actual fix1 PR~10 minutes
Probe-first would have saved ~80 % of wall time and reduced decision fatigue.

Generalization

  • If a function’s failure mode is return null, wrap it in a probe before you patch upstream. If it’s throw, the stack tells you the answer — probe isn’t needed.
  • Treat every layer-4 fix as a layer-5 symptom until proven otherwise. Five successive PRs that “should have fixed it” is a signal you’ve misnamed the problem.
  • Live cookies + live curl beat local reproduction when the bug is environment-specific. Sign up a throwaway airlock user, grab the cookie, curl casa’s endpoints. Repeatable, deterministic, requires zero IDE state.

What changes

  • doctrines/surface-swallowed-errors-with-probe-endpoint.doctrine.md is the reusable playbook.
  • Agents and devs should reach for a probe after the first failed speculative fix, not the fifth.
  • Consider a lint rule (or atlas ADR) against new catch { return null } patterns that don’t also emit a structured error-counter metric.