Context
Airlock’s cross-apex handoff (/api/auth/handoff) had been designed and landed in code for weeks, but stratt.dev was the first external apex to actually exercise it end-to-end. Everything that was theoretical about the flow turned out to matter in practice — in three places that were not the places I expected.
What we got right
- Separation of metadata from gating. CASA’s App Registry (
app_registrytable in family-hub) is a launcher/nav catalog, not a security boundary. HATCH’shandoff_consumersis the actual allowlist airlock checks on every mint. Registering an app in both is not duplication — it is the correct division of concerns. The 30s cache TTL on the allowlist is appropriate: slow enough to protect the DB, fast enough for operator self-service. - Asymmetric signatures. Ed25519 + JWKS is the right call. Consumers need only a URL (
AIRLOCK_ISSUER_URL) to verify. No shared secret to rotate. No key material to leak from the consumer side. Confidence-inducing. - 60-second TTL on the JWT. Tight enough that a leaked URL is meaningless; generous enough that no real flow hits expiration. In practice we never saw a legitimate expiration during the bring-up.
- Signing failure 302s back with
?error=signing_failedinstead of rendering a 500 on airlock. This turned out to be important — it meant stratt.dev could render a branded error on its own domain rather than dumping users on airlock’s. Worth keeping.
What we got wrong (and fixed)
- BetterAuth’s encrypted-at-rest JWK path. Produced
public_key/private_keypair drift silently; every handoff verify failed withJWSSignatureVerificationFaileddespite matchingkid. Disabling encryption viajwt({ jwks: { disablePrivateKeyEncryption: true } })is now the default. The raw JWK at rest is defensible — the DB volume is the trust boundary, and re-encryption through a buggy upstream path is strictly worse than plaintext. - Assumed
url.hostunder Vercel equals the public apex. It doesn’t. Always source the canonical apex from env, never from the incoming request. ThebuildHandoffRedirecthelper in the meridian SDK already had this defense; the callback handler didn’t. Pattern consistency inside a single codebase is what actually protects you. import.meta.envas the only env source. Build-time substitution fails silently for SSR-only vars on Vercel.process.envfirst,import.meta.envas fallback, explicit default third — this belt-and-suspenders pattern is doctrine now (vercel-astro-runtime-env-sourcing.doctrine.md).
What we’d do differently next time
- Ship better consumer error copy from day one. The generic “usually because it expired” wasted rounds. Even a hidden HTML comment with the jose error code would have cut the diagnostic loop by ~80%. Consumer apps should render friendly copy but emit the code + stack to server logs and expose a
?diag=1mode operators can toggle. - Reference-implementation the consumer. The
apps/meridian/src/lib/airlock-handoff.tsmodule is a good small SDK — future external apexes should copy it verbatim rather than rewriting. Consider publishing it as@devarno/airlock-consumeronce a second external apex onboards; the abstraction is more obvious after N=2. - Dedup invariant on
jwkstable.loadSigningKeypicks the newest row bycreatedAt, but/api/auth/jwksexposespublic_keycolumn content which may not be in 1:1 correspondence with the row the signer picks if multiple rows exist. A unique index onkid+ a check-constraint that thepublic_keyJWK’s public portion derives from theprivate_keywould have caught Bug 1 at provisioning time instead of at verify time. - Bake the CLI check into onboarding.
curl airlock/api/auth/jwks | jq '.keys[0].kid'and compare to whatdecode-jwtshows in the token header. Five seconds of runbook for a class of bug that cost hours.
Non-obvious lessons
- BetterAuth’s own OIDC signing path can succeed while the plugin’s
jwksrow is corrupt. BetterAuth caches the keypair in memory on first boot; if provisioning produced a mismatched pair but the in-memory representation uses only the private scalar, OIDC tokens get signed and verified fine (if verification is done via the same in-memory public). OnlyloadSigningKey’s cold-path through the DB exposed the drift. This is why thejwkspage looked “healthy” for days while handoff was broken: healthy for BetterAuth’s internal consumers, broken for anyone relying on/api/auth/jwksas ground truth. - “Token expired” in error UIs is almost always lying. jose throws distinct error codes for every failure mode; any consumer error page that condenses them to a single string is actively harmful to diagnosis. Treat the error page itself as a debugging surface, not just a user-facing artifact.
- Airlock logs looked clean the entire time. 302s on every handoff. This is the right design (issuer never knows whether the consumer ultimately accepted the token) but it means cross-apex issues cannot be diagnosed from airlock logs alone. Vercel runtime logs on the consumer side are mandatory.
- The fastest debugging loop was
vercel logs https://<apex> | grep --line-buffered …piped to Monitor. Streaming the consumer’s runtime errors into chat cut round-trip time by 5-10x versus “retry and paste me the screenshot”.
Blast radius of the three commits
airlock@ff7f7bf— only affects new JWK provisioning; existing OIDC consumers continue to verify against published JWKS.stratt-run@c2a823f,@a21defc— only affect meridian’s callback and middleware; no other stratt services touched.- One
DELETE FROM jwks;in prod — invalidated any long-lived airlock-signed tokens from before 14:18. No known holders.
Follow-ups
- File an upstream issue on
better-authabout thejwtplugin’s encrypt-at-rest pair drift (see finding). - Add a
/api/auth/jwkshealth-check to airlock’s self-monitoring: on every boot, decode the storedprivate_key, derive the public key from the scalar, and assert it matches thepublic_keycolumn. Fail fast at boot rather than at verify time. - Consider extracting
apps/meridian/src/lib/airlock-handoff.tsinto@devarno/airlock-consumeronce a second external apex adopts the flow.