Skip to main content

KAHN Scope: tool-call drilldown ships

Headline

/#/agents/<run_id> no longer renders a flat table of tool calls. It groups them per-tool (busiest first by total wall time), and clicking any invocation reveals its input, output, and error inline — no JSON-tab-switching to triage a failure.

What changed

Three behaviours the Phase A flat table couldn’t express:
  • Per-tool grouping. A failing agent typically retries the same tool. The dashboard collapses N invocations of Bash (or Read, or Edit, or any custom tool) under one header showing count, total_duration, errors, and the distinct agents that called it.
  • Click-to-expand. Click any row → inline <pre> blocks reveal input_summary (what the agent sent), output_summary (what came back), and error (when the call failed). 4KB visible cap with explicit (truncated) suffix; underlying event data unchanged.
  • Filter pills. By agent_id (multi-agent runs only) and by ok | error. The most common triage cut is “show me only the errors” — one click.
State persists across the 10s refresh. Expanding a row, then waiting for a refresh, doesn’t reset the operator’s selection.

What this answers

“Why did this agent fail?” used to mean: open the run, find the failing tool call, copy the run_id, search the logs, find the matching event, parse the JSON. Now it means: open the run, click the row. The convergence badge in the run header complements the drilldown: it’s the run’s score with a delta-vs-prior arrow (▲ green when improving, ▼ red when regressing). The triage path is now: badge says “down 0.4 from prior” → drilldown shows which tool errored → fix it.

Why this matters for pilots

External-pilot demos through Phase A had to context-switch into raw event JSON to show “what the agent actually did.” That’s the moment the demo loses authority. The drilldown closes that gap — the operator’s mental model of “I can see what’s happening in my fleet” stays intact through to the failure cause.

What’s next

The drilldown is the substrate for replay (north-star Phase E E2 — scrub through agent transitions to a specific tool_invocation) and for cost attribution (Phase E3 — evidence field can carry tokens_in/tokens_out/cost_usd; the drilldown surfaces them per-invocation). The wire shape that makes both of those possible is already there, tested against checked-in fixtures. The doctrine that gated this taskset’s scope decision (d-probe-before-contract-extension) saved one full schema-extension PR — H1 closed by inspection rather than empirical run.