KAHN Scope: audit flakiness ships
Headline
/#/audits rolls up audit_checkpoint events across every agent run, sorts checkpoints by flakiness (most degrading first), and deep-links each row’s last fail to the failing run. Three clicks from “is anything currently flaky?” to “which agent ran when this failed?”
What changed
A new SPA surface:/#/audits. Backed by /api/audit-checkpoints/flakiness?window=24h|7d|all (sister endpoint to /api/agent-runs/aggregate).
Each row reports:
checkpoint_id— e.g.audit:rls-enforcement,audit:test-coverage,audit:dependency-drift.pass / fail / warncounts per checkpoint across the window.flakiness—1 − |P − 0.5| × 2whereP = pass / (pass + fail). Peaks at 1.0 when pass equals fail (50/50 max flake); is 0.0 for always-pass, always-fail, or no decisive data.warnresults count towardtotalbut never enter the formula (advisory).last_fail_run_id— deep-links to/#/agents/<run_id>so a flaky check is one click from the failing agent run.
What this answers
“Which audit is currently degrading?” — the surface KAHN’s strategy doc named as a gap. Phase A had the events on the wire. Phase B has the rollup that makes the events legible. The endpoint is RLS-scoped on Postgres; FS backend goes through the sameaggregate_flakiness pure helper. Wire-byte parity across both backends — drift is enforced by code, not convention.
What this means for the audit/ pattern
KAHN’s Phase E roadmap names/audit/ as the verb — drop an audit/ folder in your repo, see your agents in Scope. The flakiness surface is what makes that pitch concrete: every checkpoint a producer emits gets a stability score, automatically, without any per-checkpoint configuration.
The 6 TASKSET-5 fixtures (3 audits × pass+fail) that have been sitting in tree since Phase A now light up the dashboard as 3 maximally-flaky checkpoints (each pass+fail = flakiness = 1.0). The fixtures were authored specifically so this surface had data to design against; they have now earned their place.
What’s next
Per-tenant audit-policy webhooks (north-star Phase E E1 — let a tenant define which checkpoints are required vs. advisory, get notified whenresult == fail lands) consume this same wire shape. The substrate is ready; the policy layer is the next deliverable.