Skip to main content

Context

Airlock’s cross-apex handoff (/api/auth/handoff) had been designed and landed in code for weeks, but stratt.dev was the first external apex to actually exercise it end-to-end. Everything that was theoretical about the flow turned out to matter in practice — in three places that were not the places I expected.

What we got right

  • Separation of metadata from gating. CASA’s App Registry (app_registry table in family-hub) is a launcher/nav catalog, not a security boundary. HATCH’s handoff_consumers is the actual allowlist airlock checks on every mint. Registering an app in both is not duplication — it is the correct division of concerns. The 30s cache TTL on the allowlist is appropriate: slow enough to protect the DB, fast enough for operator self-service.
  • Asymmetric signatures. Ed25519 + JWKS is the right call. Consumers need only a URL (AIRLOCK_ISSUER_URL) to verify. No shared secret to rotate. No key material to leak from the consumer side. Confidence-inducing.
  • 60-second TTL on the JWT. Tight enough that a leaked URL is meaningless; generous enough that no real flow hits expiration. In practice we never saw a legitimate expiration during the bring-up.
  • Signing failure 302s back with ?error=signing_failed instead of rendering a 500 on airlock. This turned out to be important — it meant stratt.dev could render a branded error on its own domain rather than dumping users on airlock’s. Worth keeping.

What we got wrong (and fixed)

  • BetterAuth’s encrypted-at-rest JWK path. Produced public_key / private_key pair drift silently; every handoff verify failed with JWSSignatureVerificationFailed despite matching kid. Disabling encryption via jwt({ jwks: { disablePrivateKeyEncryption: true } }) is now the default. The raw JWK at rest is defensible — the DB volume is the trust boundary, and re-encryption through a buggy upstream path is strictly worse than plaintext.
  • Assumed url.host under Vercel equals the public apex. It doesn’t. Always source the canonical apex from env, never from the incoming request. The buildHandoffRedirect helper in the meridian SDK already had this defense; the callback handler didn’t. Pattern consistency inside a single codebase is what actually protects you.
  • import.meta.env as the only env source. Build-time substitution fails silently for SSR-only vars on Vercel. process.env first, import.meta.env as fallback, explicit default third — this belt-and-suspenders pattern is doctrine now (vercel-astro-runtime-env-sourcing.doctrine.md).

What we’d do differently next time

  • Ship better consumer error copy from day one. The generic “usually because it expired” wasted rounds. Even a hidden HTML comment with the jose error code would have cut the diagnostic loop by ~80%. Consumer apps should render friendly copy but emit the code + stack to server logs and expose a ?diag=1 mode operators can toggle.
  • Reference-implementation the consumer. The apps/meridian/src/lib/airlock-handoff.ts module is a good small SDK — future external apexes should copy it verbatim rather than rewriting. Consider publishing it as @devarno/airlock-consumer once a second external apex onboards; the abstraction is more obvious after N=2.
  • Dedup invariant on jwks table. loadSigningKey picks the newest row by createdAt, but /api/auth/jwks exposes public_key column content which may not be in 1:1 correspondence with the row the signer picks if multiple rows exist. A unique index on kid + a check-constraint that the public_key JWK’s public portion derives from the private_key would have caught Bug 1 at provisioning time instead of at verify time.
  • Bake the CLI check into onboarding. curl airlock/api/auth/jwks | jq '.keys[0].kid' and compare to what decode-jwt shows in the token header. Five seconds of runbook for a class of bug that cost hours.

Non-obvious lessons

  • BetterAuth’s own OIDC signing path can succeed while the plugin’s jwks row is corrupt. BetterAuth caches the keypair in memory on first boot; if provisioning produced a mismatched pair but the in-memory representation uses only the private scalar, OIDC tokens get signed and verified fine (if verification is done via the same in-memory public). Only loadSigningKey’s cold-path through the DB exposed the drift. This is why the jwks page looked “healthy” for days while handoff was broken: healthy for BetterAuth’s internal consumers, broken for anyone relying on /api/auth/jwks as ground truth.
  • “Token expired” in error UIs is almost always lying. jose throws distinct error codes for every failure mode; any consumer error page that condenses them to a single string is actively harmful to diagnosis. Treat the error page itself as a debugging surface, not just a user-facing artifact.
  • Airlock logs looked clean the entire time. 302s on every handoff. This is the right design (issuer never knows whether the consumer ultimately accepted the token) but it means cross-apex issues cannot be diagnosed from airlock logs alone. Vercel runtime logs on the consumer side are mandatory.
  • The fastest debugging loop was vercel logs https://<apex> | grep --line-buffered … piped to Monitor. Streaming the consumer’s runtime errors into chat cut round-trip time by 5-10x versus “retry and paste me the screenshot”.

Blast radius of the three commits

  • airlock@ff7f7bf — only affects new JWK provisioning; existing OIDC consumers continue to verify against published JWKS.
  • stratt-run@c2a823f, @a21defc — only affect meridian’s callback and middleware; no other stratt services touched.
  • One DELETE FROM jwks; in prod — invalidated any long-lived airlock-signed tokens from before 14:18. No known holders.

Follow-ups

  • File an upstream issue on better-auth about the jwt plugin’s encrypt-at-rest pair drift (see finding).
  • Add a /api/auth/jwks health-check to airlock’s self-monitoring: on every boot, decode the stored private_key, derive the public key from the scalar, and assert it matches the public_key column. Fail fast at boot rather than at verify time.
  • Consider extracting apps/meridian/src/lib/airlock-handoff.ts into @devarno/airlock-consumer once a second external apex adopts the flow.