KAHN Scope: tool-call drilldown ships
Headline
/#/agents/<run_id> no longer renders a flat table of tool calls. It groups them per-tool (busiest first by total wall time), and clicking any invocation reveals its input, output, and error inline — no JSON-tab-switching to triage a failure.
What changed
Three behaviours the Phase A flat table couldn’t express:- Per-tool grouping. A failing agent typically retries the same tool. The dashboard collapses N invocations of
Bash(orRead, orEdit, or any custom tool) under one header showingcount,total_duration,errors, and the distinct agents that called it. - Click-to-expand. Click any row → inline
<pre>blocks revealinput_summary(what the agent sent),output_summary(what came back), anderror(when the call failed). 4KB visible cap with explicit(truncated)suffix; underlying event data unchanged. - Filter pills. By
agent_id(multi-agent runs only) and byok | error. The most common triage cut is “show me only the errors” — one click.
What this answers
“Why did this agent fail?” used to mean: open the run, find the failing tool call, copy the run_id, search the logs, find the matching event, parse the JSON. Now it means: open the run, click the row. The convergence badge in the run header complements the drilldown: it’s the run’s score with a delta-vs-prior arrow (▲ green when improving, ▼ red when regressing). The triage path is now: badge says “down 0.4 from prior” → drilldown shows which tool errored → fix it.Why this matters for pilots
External-pilot demos through Phase A had to context-switch into raw event JSON to show “what the agent actually did.” That’s the moment the demo loses authority. The drilldown closes that gap — the operator’s mental model of “I can see what’s happening in my fleet” stays intact through to the failure cause.What’s next
The drilldown is the substrate for replay (north-star Phase E E2 — scrub through agent transitions to a specific tool_invocation) and for cost attribution (Phase E3 —evidence field can carry tokens_in/tokens_out/cost_usd; the drilldown surfaces them per-invocation). The wire shape that makes both of those possible is already there, tested against checked-in fixtures.
The doctrine that gated this taskset’s scope decision (d-probe-before-contract-extension) saved one full schema-extension PR — H1 closed by inspection rather than empirical run.