A2A Prompt Evolution: External Gist as Spec Source + Failure Modes as Acceptance Criteria

What We Learned

The spec-to-package prompt pattern from the fingerprint session (read spec, define API surface, enumerate test cases) scaled to a larger package with two new refinements.

1. External Gist as Spec Reference

The prompt referenced a GitHub gist by ID for the TAD. The agent’s WebFetch failed (404), but it fell back to gh gist view via CLI — and retrieved all 1,670 lines. Then it grep’d for specific sections (Layer 3, failure_modes) and read only the relevant ranges. Takeaway: External spec references are more resilient than pasting content into prompts. The agent can navigate to the right section. If one access method fails, it finds another.

2. Failure Modes as Acceptance Criteria

The prompt specified: “Write tests for each failure mode (8 tests minimum, one per FM) plus a happy-path test.” Each FM-XX maps to a named trigger, a detection function, and a test case. This eliminated the usual negotiation about test coverage — the TAD’s failure mode table IS the test plan. The first test run hit 34/35 (FM-08 needed a try/catch wrapper in the CI orchestrator). One fix, 35/35.

3. Cross-Package Type Reuse via Exploration

The prompt said “parse their imports blocks” and “verify protected agent presence” — it didn’t specify the Unit type shape or CompositionStep fields. The agent explored @stratt/schema to discover these types, then imported and used them directly. This kept the prompt focused on behaviour rather than types.

The Refined Pattern

Read [handbook section] at [path] and [TAD section] at [gist ID] Layer N.
Implement @stratt/{package} as a Bun TypeScript package:
  (1) src/{module}.ts — {behaviour using domain terms from the TAD}
  (2) src/{module}.ts — {behaviour referencing failure mode IDs}
  ...
Write tests for each failure mode ({N} tests minimum, one per FM-XX)
plus a happy-path test where a valid PR passes all checks.

Scaling Observation

The fingerprint prompt produced 4 modules and 98 tests. The graph prompt produced 5 modules and 35 tests. The graph package was architecturally more complex (cross-package imports, DAG algorithms, CI orchestration) but the prompt was shorter — because the failure mode table compressed the test specification. Named failure modes are the most information-dense acceptance criteria available.

​What We Learned

​1. External Gist as Spec Reference

​2. Failure Modes as Acceptance Criteria

​3. Cross-Package Type Reuse via Exploration

​The Refined Pattern

​Scaling Observation