What happened
A user reportedcasa.devarno.cloud showed an infinite loop after sign-in. Five consecutive PRs shipped, each correctly fixing a real bug: (1) dead BetterAuth /sign-in page removed, (2) request-storm-causing ImpersonationBanner removed, (3) middleware set headers on response instead of request, (4) cookie-read moved from headers() to cookies() for Next 16. All deployed clean. The loop persisted.
Only then did a three-line diagnostic endpoint reveal the actual error — Postgres: column "tier" of relation "users" does not exist. Drizzle migration 0007 had been sitting in the repo un-applied for weeks. CASA’s getAuthUser try/catch had been swallowing this exception on every request since TASKSET 7 shipped.
Why the symptom chain was misleading
Every layer had a real bug. The earlier fixes were correct and shipped, they just weren’t this bug. The loop symptom was consistent with all of them. Without data (the probe), pattern-matching on symptoms is a random walk.The pattern to adopt
When a production symptom is “function X returns null in a hot path” and X has acatch { ... return null } anywhere on its path, stop speculating after one speculative fix. Ship a probe. A three-line route handler that re-runs X in isolation and returns the raw exception body will outperform any number of theory-led patches.
Cost accounting
| Step | Attempts | Time cost |
|---|---|---|
| Theory-led patches | 5 PRs (all valid bugs, none the root cause) | ~3 hours |
| Probe endpoint | 1 PR | ~15 minutes |
| Actual fix | 1 PR | ~10 minutes |
Generalization
- If a function’s failure mode is
return null, wrap it in a probe before you patch upstream. If it’sthrow, the stack tells you the answer — probe isn’t needed. - Treat every layer-4 fix as a layer-5 symptom until proven otherwise. Five successive PRs that “should have fixed it” is a signal you’ve misnamed the problem.
- Live cookies + live curl beat local reproduction when the bug is environment-specific. Sign up a throwaway airlock user, grab the cookie, curl casa’s endpoints. Repeatable, deterministic, requires zero IDE state.
What changes
doctrines/surface-swallowed-errors-with-probe-endpoint.doctrine.mdis the reusable playbook.- Agents and devs should reach for a probe after the first failed speculative fix, not the fifth.
- Consider a lint rule (or atlas ADR) against new
catch { return null }patterns that don’t also emit a structured error-counter metric.