Skip to main content

KAHN Scope: audit flakiness ships

Headline

/#/audits rolls up audit_checkpoint events across every agent run, sorts checkpoints by flakiness (most degrading first), and deep-links each row’s last fail to the failing run. Three clicks from “is anything currently flaky?” to “which agent ran when this failed?”

What changed

A new SPA surface: /#/audits. Backed by /api/audit-checkpoints/flakiness?window=24h|7d|all (sister endpoint to /api/agent-runs/aggregate). Each row reports:
  • checkpoint_id — e.g. audit:rls-enforcement, audit:test-coverage, audit:dependency-drift.
  • pass / fail / warn counts per checkpoint across the window.
  • flakiness1 − |P − 0.5| × 2 where P = pass / (pass + fail). Peaks at 1.0 when pass equals fail (50/50 max flake); is 0.0 for always-pass, always-fail, or no decisive data. warn results count toward total but never enter the formula (advisory).
  • last_fail_run_id — deep-links to /#/agents/<run_id> so a flaky check is one click from the failing agent run.
Visual encoding: 100px inline bar, bucket-coloured (red ≥ 0.5, amber ≥ 0.2, green < 0.2). Numeric flakiness rendered alongside.

What this answers

“Which audit is currently degrading?” — the surface KAHN’s strategy doc named as a gap. Phase A had the events on the wire. Phase B has the rollup that makes the events legible. The endpoint is RLS-scoped on Postgres; FS backend goes through the same aggregate_flakiness pure helper. Wire-byte parity across both backends — drift is enforced by code, not convention.

What this means for the audit/ pattern

KAHN’s Phase E roadmap names /audit/ as the verb — drop an audit/ folder in your repo, see your agents in Scope. The flakiness surface is what makes that pitch concrete: every checkpoint a producer emits gets a stability score, automatically, without any per-checkpoint configuration. The 6 TASKSET-5 fixtures (3 audits × pass+fail) that have been sitting in tree since Phase A now light up the dashboard as 3 maximally-flaky checkpoints (each pass+fail = flakiness = 1.0). The fixtures were authored specifically so this surface had data to design against; they have now earned their place.

What’s next

Per-tenant audit-policy webhooks (north-star Phase E E1 — let a tenant define which checkpoints are required vs. advisory, get notified when result == fail lands) consume this same wire shape. The substrate is ready; the policy layer is the next deliverable.