Multi-tenant RLS: the illusion vs the reality

What happened

KAHN Cloud’s schema had Row-Level Security policies on every multi-tenant table. Policies referenced a per-request session variable. Handlers set the variable before every query. Tests passed. Schema review passed. On the day of rollout, a cross-tenant probe returned every tenant’s rows.

Why

Railway’s default Postgres role has both rolsuper=t and rolbypassrls=t. Superusers and BYPASSRLS roles skip every RLS policy, unconditionally. The policies we’d written were inert for the entire life of the service — defense in depth on paper, but the “app” was also the “superuser”. Even FORCE ROW LEVEL SECURITY doesn’t help — that clause binds table owners, not superusers. The only fix is to connect as a role that isn’t either.

The fix (applied same-session)

Three migrations plus an env rotation:

FORCE ROW LEVEL SECURITY on all multi-tenant tables (closes the owner-gap for future role changes).
Create a dedicated app role with NOSUPERUSER NOBYPASSRLS, grant minimum needed privileges, extend every policy with WITH CHECK (required for non-superuser INSERT).
Expose the pre-tenant-context api_keys and tenants lookups as narrow SECURITY DEFINER functions — the one intentional cross-tenant hole.
Rotate DATABASE_URL on the app service to the new credentials.

Verification probe, run as the new app role: unset context → hard error; tenant A sees only A’s rows; cross-tenant INSERT rejected.

The systemic lesson

“Does RLS work?” is two questions, not one:

Are the policies correct? — what schema review catches.
Is the app’s DB role one that obeys policies? — not caught by schema review, tests, or code review. Only a cross-tenant probe executed as the app role catches it.

A cloud-mode release smoke must include that probe. It takes 60 seconds, it prevents a class of silent data-leak bugs, and it is the only signal you’ll get that the isolation layer is alive rather than just shaped correctly.

What we institutionalised

A doctrine that names the trap and the fix verbatim.
A SECURITY DEFINER doctrine with the four-point audit checklist for every narrow cross-tenant hole.
A production-rollout prompt-op whose Taskset 19 exit criteria embed the cross-tenant probe as non-skippable.

Sister projects (choco, stratt, traceo) that go cloud inherit the pattern. Every future Railway-backed Postgres deployment starts with a non-superuser app role, full stop.

Business impact

A multi-tenant SaaS shipping with inert RLS is a data-breach waiting for a handler bug. This session caught it before any external customer existed. Pre-customer discovery cost: ~90 engineer-minutes, three migrations, one env rotation. Post-customer discovery cost: an incident, a public disclosure, and an unknown quantity of trust.

​Multi-tenant RLS: the illusion vs the reality

​What happened

​Why

​The fix (applied same-session)

​The systemic lesson

​What we institutionalised

​Business impact