Skip to main content

Quick Reference

PropertyValue
DomainIncident
FORGE Stage5 (Verify)
Version1.0.0
Primary OutputPostmortem reports with root cause analysis and action items
Use this agent when you need to:
  • Conduct root cause analysis using structured frameworks (5 Whys, Fishbone)
  • Reconstruct detailed incident timelines for analysis
  • Generate actionable improvement recommendations
  • Document learnings for future incident prevention

Core Capabilities

Root Cause Analysis

Identifies underlying causes using structured methodologies (5 Whys, Fishbone, systems thinking)

Timeline Analysis

Reconstructs and analyzes incident chronology to identify key decision points

Action Item Generation

Creates specific, actionable improvement tasks with owners and due dates

Knowledge Capture

Documents learnings and patterns for organizational memory and future prevention

When to Use

Ideal Use Cases

SEV0-SEV2 incidents requiring formal post-incident analysis
Incidents with unclear or complex root causes
Recurring incident patterns needing systematic investigation
Incidents with multiple contributing factors
Incidents requiring organizational learning and process improvement

Usage Examples

Postmortem: API Gateway Connection Pool

Incident: INC-20240115-0042 (SEV2, 25 minutes, 150 users affected)Postmortem Report
# Postmortem: API Gateway Connection Pool Exhaustion

**Incident ID**: INC-20240115-0042
**Date**: 2024-01-15
**Severity**: SEV2
**Duration**: 25 minutes
**Authors**: @oncall-backend, @postmortem-analyst

## Summary

On January 15, 2024, the SO1 Control Plane API experienced elevated error 
rates (12% 5xx) for 25 minutes due to database connection pool exhaustion. 
Approximately 150 users were affected, experiencing failed workflow 
executions and API timeouts.

### Impact
- 150 users affected
- ~200 workflow executions failed
- API error rate peaked at 12%
- No data loss

## Timeline

| Time (UTC) | Event | Significance |
|------------|-------|--------------|
| 14:02 | Connection pool hits 100% utilization | First sign |
| 14:05 | PagerDuty alert fires | Detection |
| 14:08 | Incident declared | Response initiated |
| 14:15 | Root cause identified | Investigation complete |
| 14:20 | Pool size increased 20→50 | Mitigation applied |
| 14:25 | Error rate normal | Impact ends |

## Root Cause Analysis

### Primary Root Cause

The database connection pool was configured with a maximum of 20 connections, 
which was insufficient for the concurrent load generated by the new bulk 
workflow execution feature deployed earlier that day.

### 5 Whys Analysis

1. **Why did the API return 5xx errors?**
   → Because database queries were timing out

2. **Why were queries timing out?**
   → Because no database connections were available

3. **Why were no connections available?**
   → Because the connection pool was exhausted (20/20 in use)

4. **Why was the pool exhausted?**
   → Because bulk execution opened many concurrent connections

5. **Why wasn't the pool sized for this load?**
   → Because connection requirements weren't load tested before deployment

### Contributing Factors

| Factor | Category | Preventable |
|--------|----------|-------------|
| No load testing for bulk execution | process | ✓ |
| Pool size not documented | process | ✓ |
| No alert on pool utilization | technology | ✓ |
| Deployed during peak hours | process | ✓ |

## What Went Well

- Alert fired within 2 minutes
- Quick incident declaration (3 min)
- Root cause identified in 10 minutes
- Non-disruptive mitigation (no restart)
- Clear communication throughout

## What Went Poorly

- No proactive pool monitoring
- Load testing missed bulk operations
- Pool configuration undocumented
- Peak-hour deployment without gradual rollout

## Where We Got Lucky

- Fix worked immediately
- No data inconsistencies
- Experienced engineer on-call

## Action Items

| ID | Type | Description | Owner | Priority | Due |
|----|------|-------------|-------|----------|-----|
| AI-001 | detect | Add monitor for pool utilization >80% | @platform | P1 | 2024-01-17 |
| AI-002 | prevent | Document pool sizing in runbook | @backend | P2 | 2024-01-19 |
| AI-003 | prevent | Add load testing for bulk ops to CI | @qa | P1 | 2024-01-26 |
| AI-004 | process | Require gradual rollout for features | @eng-mgr | P2 | 2024-01-31 |
| AI-005 | prevent | Review all service connection pools | @platform | P2 | 2024-02-15 |

## Lessons Learned

1. **Connection pools need explicit sizing** - Defaults are often insufficient
2. **Load test new features** - Especially those changing concurrency
3. **Monitor before you need it** - Pool metrics should be day-one alerts
4. **Deploy during low-traffic** - Or use gradual rollout

## Metrics

| Metric | Value |
|--------|-------|
| Time to Detect | 5 min |
| Time to Mitigate | 15 min |
| Time to Resolve | 25 min |
| Responders | 2 |

## Tags

`database` `connection-pool` `capacity` `deployment` `bulk-operations`
Key Insights:
  • Root cause: Insufficient connection pool sizing
  • Process gap: No load testing for concurrency changes
  • 5 action items created, 3 high priority
  • Clear documentation for future prevention

Output Format

Postmortem Report Schema

interface PostmortemReport {
  type: "postmortem-report";
  version: "1.0.0";
  generated_by: "postmortem-analyst";
  timestamp: string; // ISO8601
  
  content: {
    incident_id: string; // INC-YYYYMMDD-XXXX
    title: string;
    date: string; // YYYY-MM-DD
    authors: string[];
    reviewers: string[];
    status: "draft" | "review" | "published";
    
    summary: {
      incident_summary: string;
      impact_summary: string;
      duration: string;
      severity: "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4";
      detection_method: string;
    };
    
    timeline: Array<{
      timestamp: string;
      event: string;
      significance: string;
    }>;
    
    root_cause: {
      primary: string;
      analysis_method: "5 Whys" | "Fishbone" | "Systems Analysis";
      analysis_detail: object;
      confidence: number; // 0.0 - 1.0
    };
    
    contributing_factors: Array<{
      factor: string;
      category: "process" | "technology" | "human" | "external";
      preventable: boolean;
    }>;
    
    what_went_well: string[];
    what_went_poorly: string[];
    where_we_got_lucky: string[];
    
    action_items: Array<{
      id: string; // AI-XXX
      type: "mitigate" | "prevent" | "detect" | "process";
      description: string;
      owner: string;
      priority: "P0" | "P1" | "P2" | "P3";
      due_date: string;
      status: "open" | "in_progress" | "completed";
      tracking_url?: string;
    }>;
    
    lessons_learned: string[];
    
    metrics: {
      time_to_detect: string;
      time_to_mitigate: string;
      time_to_resolve: string;
      total_duration: string;
      responder_count: number;
    };
    
    related_incidents: string[];
    tags: string[];
  };
}

Root Cause Analysis Methodologies

5 Whys

Use when: Single, clear failure path
1. Why did X happen? → Because Y
2. Why did Y happen? → Because Z
3. Why did Z happen? → Because W
4. Why did W happen? → Because V
5. Why did V happen? → [Root cause]
Example: Connection pool exhaustion (see Usage Examples)

Fishbone (Ishikawa) Diagram

Use when: Multiple contributing factors Categories:
  • People: Human factors, expertise gaps
  • Process: Procedures, workflows, communication
  • Technology: Systems, infrastructure, tools
  • External: Vendor issues, dependencies
Example: Database cluster failure (see Usage Examples)

Systems Thinking

Use when: Complex, emergent failures Focuses on:
  • Feedback loops and cascading effects
  • Organizational dynamics
  • Latent conditions (pre-existing weaknesses)
  • Normal accidents (complex system interactions)

Action Item Types

TypePurposeExample
MitigateReduce impact if recursAdd circuit breaker, implement rate limiting
PreventStop from happening againFix bug, add validation, update config
DetectFind issues fasterAdd monitoring, create alerts, log key events
ProcessImprove workflowsUpdate runbooks, change review process, add training

Priority Levels

  • P0 (Critical): Must complete within 1 week, blocks similar incidents
  • P1 (High): Complete within 2 weeks, significantly reduces risk
  • P2 (Medium): Complete within 4 weeks, incremental improvement
  • P3 (Low): Nice to have, low impact on recurrence

FORGE Gate Compliance

Before invoking this agent, ensure:
  • Incident resolved: Status set to “closed” by Incident Commander
  • Record available: Complete incident record with timeline
  • Responders identified: Key participants available for input
  • Data accessible: Logs, metrics, code changes retrievable
Verification: Factory Orchestrator confirms incident closure and data availability
This agent completes successfully when:
  • Root cause identified: Primary cause documented with high confidence
  • Factors documented: All contributing factors categorized
  • Timeline complete: Full chronology with significance annotations
  • Action items created: Specific tasks with owners and due dates
  • Report published: Postmortem document available in knowledge base
  • Decision record logged: Analysis rationale in ADR format
Verification: Gatekeeper validates postmortem completeness and action item clarity
All significant postmortem conclusions are logged as:
date:2024-01-15T16:00:00Z|context:Analyzing connection pool exhaustion incident|decision:Root cause is undersized pool vs application bug|rationale:5 Whys leads to deployment without load testing, pool size was consequence|consequences:Action items focus on testing and monitoring, not just pool tuning|status:accepted

Integration Points

Control Plane API

Used for:
  • Reviewing code changes related to incidents
  • Accessing deployment history
  • Retrieving performance metrics
No direct API calls - operates on historical data

Veritas Prompt Library

Consumes:
  • vrt-rca01: Root cause analysis frameworks (5 Whys, Fishbone)
  • vrt-5whys01: 5 Whys analysis template
  • vrt-blameless01: Blameless postmortem culture guidelines
Produces:
  • Incident learnings in veritas/agent-prompts/incident/
  • Status: draft (requires review)

Repository Integration

  • All SO1 repos: Code review for root cause investigation
  • Documentation repo: Publish postmortems and runbooks
AgentRelationshipIntegration Point
Incident CommanderUpstreamReceives incident record after resolution
Triage ResponderPeerMay review triage patterns for improvement
Runbook WriterDownstreamCreates runbooks from postmortem learnings
Factory OrchestratorPeerMay request postmortem for pattern analysis

Workflow Process

1

Data Collection

Gather all incident-related information
  • Retrieve incident record
  • Collect relevant logs
  • Export metrics
  • Interview responders
2

Timeline Reconstruction

Build comprehensive incident timeline
  • Chronological event ordering
  • Identify key decision points
  • Annotate significance
  • Map communication flow
3

Root Cause Analysis

Apply RCA methodology
  • Select appropriate method (5 Whys, Fishbone, Systems)
  • Conduct analysis
  • Identify contributing factors
  • Validate conclusions
4

Action Item Generation

Create improvement actions
  • Identify prevention opportunities
  • Prioritize by impact
  • Assign owners
  • Set due dates
5

Report Writing

Produce final postmortem document
  • Write narrative summary
  • Document analysis
  • Include all sections
  • Publish to knowledge base

Blameless Culture Principles

Critical: All postmortems must follow blameless principles
  • Focus on systems and processes, not individuals
  • Assume good intentions from all responders
  • Seek organizational learning, not punishment
  • Ask “what went wrong?” not “who made a mistake?”
  • Create psychological safety for honest discussion

Blameless Language Examples

❌ Blame-focused✅ Blameless
”Engineer X deployed buggy code""Deployment lacked sufficient testing"
"On-call failed to respond quickly""Alert fatigue delayed response"
"Team ignored warnings""Warning signals were unclear”

Error Handling

Common Issues

Insufficient Data for RCACause: Logs rotated, metrics not retained, responders unavailableResolution: Document data gaps, use available evidence, note confidence level <70%
Multiple Plausible Root CausesCause: Complex incident with unclear causalityResolution: Use Fishbone diagram, document all contributing factors, assign confidence levels
Recurrence of Known IssuesCause: Previous action items not completedResolution: Highlight pattern, escalate incomplete action items, recommend process changes

Escalation Path

If Postmortem Analyst cannot complete analysis:
  1. Document partial analysis with confidence levels
  2. Escalate to engineering leadership for additional investigation
  3. Schedule follow-up analysis with required experts
  4. Log decision record noting blockers

Success Metrics

MetricTargetCritical Threshold
Postmortem Completion Time<48 hours>72 hours
Action Item Completion Rate>90%<75%
Root Cause Confidence>85%<70%
Recurrence Rate<10%>25%

Source Files

View Agent Source

Maintained in so1-agents repository under agents/incident/postmortem-analyst.md