Postmortem Analyst - SO1 Documentation

Quick Reference

Property	Value
Domain	Incident
FORGE Stage	5 (Verify)
Version	1.0.0
Primary Output	Postmortem reports with root cause analysis and action items

Use this agent when you need to:

Conduct root cause analysis using structured frameworks (5 Whys, Fishbone)
Reconstruct detailed incident timelines for analysis
Generate actionable improvement recommendations
Document learnings for future incident prevention

Core Capabilities

Root Cause Analysis

Identifies underlying causes using structured methodologies (5 Whys, Fishbone, systems thinking)

Timeline Analysis

Reconstructs and analyzes incident chronology to identify key decision points

Action Item Generation

Creates specific, actionable improvement tasks with owners and due dates

Knowledge Capture

Documents learnings and patterns for organizational memory and future prevention

When to Use

Ideal Use Cases

SEV0-SEV2 incidents requiring formal post-incident analysis

Incidents with unclear or complex root causes

Recurring incident patterns needing systematic investigation

Incidents with multiple contributing factors

Incidents requiring organizational learning and process improvement

Not Recommended For

Usage Examples

Connection Pool Exhaustion
Database Cluster Failure
Performance Degradation

Postmortem: API Gateway Connection Pool

Incident: INC-20240115-0042 (SEV2, 25 minutes, 150 users affected)Postmortem Report

# Postmortem: API Gateway Connection Pool Exhaustion

**Incident ID**: INC-20240115-0042
**Date**: 2024-01-15
**Severity**: SEV2
**Duration**: 25 minutes
**Authors**: @oncall-backend, @postmortem-analyst

## Summary

On January 15, 2024, the SO1 Control Plane API experienced elevated error 
rates (12% 5xx) for 25 minutes due to database connection pool exhaustion. 
Approximately 150 users were affected, experiencing failed workflow 
executions and API timeouts.

### Impact
- 150 users affected
- ~200 workflow executions failed
- API error rate peaked at 12%
- No data loss

## Timeline

| Time (UTC) | Event | Significance |
|------------|-------|--------------|
| 14:02 | Connection pool hits 100% utilization | First sign |
| 14:05 | PagerDuty alert fires | Detection |
| 14:08 | Incident declared | Response initiated |
| 14:15 | Root cause identified | Investigation complete |
| 14:20 | Pool size increased 20→50 | Mitigation applied |
| 14:25 | Error rate normal | Impact ends |

## Root Cause Analysis

### Primary Root Cause

The database connection pool was configured with a maximum of 20 connections, 
which was insufficient for the concurrent load generated by the new bulk 
workflow execution feature deployed earlier that day.

### 5 Whys Analysis

1. **Why did the API return 5xx errors?**
   → Because database queries were timing out

2. **Why were queries timing out?**
   → Because no database connections were available

3. **Why were no connections available?**
   → Because the connection pool was exhausted (20/20 in use)

4. **Why was the pool exhausted?**
   → Because bulk execution opened many concurrent connections

5. **Why wasn't the pool sized for this load?**
   → Because connection requirements weren't load tested before deployment

### Contributing Factors

| Factor | Category | Preventable |
|--------|----------|-------------|
| No load testing for bulk execution | process | ✓ |
| Pool size not documented | process | ✓ |
| No alert on pool utilization | technology | ✓ |
| Deployed during peak hours | process | ✓ |

## What Went Well

- Alert fired within 2 minutes
- Quick incident declaration (3 min)
- Root cause identified in 10 minutes
- Non-disruptive mitigation (no restart)
- Clear communication throughout

## What Went Poorly

- No proactive pool monitoring
- Load testing missed bulk operations
- Pool configuration undocumented
- Peak-hour deployment without gradual rollout

## Where We Got Lucky

- Fix worked immediately
- No data inconsistencies
- Experienced engineer on-call

## Action Items

| ID | Type | Description | Owner | Priority | Due |
|----|------|-------------|-------|----------|-----|
| AI-001 | detect | Add monitor for pool utilization >80% | @platform | P1 | 2024-01-17 |
| AI-002 | prevent | Document pool sizing in runbook | @backend | P2 | 2024-01-19 |
| AI-003 | prevent | Add load testing for bulk ops to CI | @qa | P1 | 2024-01-26 |
| AI-004 | process | Require gradual rollout for features | @eng-mgr | P2 | 2024-01-31 |
| AI-005 | prevent | Review all service connection pools | @platform | P2 | 2024-02-15 |

## Lessons Learned

1. **Connection pools need explicit sizing** - Defaults are often insufficient
2. **Load test new features** - Especially those changing concurrency
3. **Monitor before you need it** - Pool metrics should be day-one alerts
4. **Deploy during low-traffic** - Or use gradual rollout

## Metrics

| Metric | Value |
|--------|-------|
| Time to Detect | 5 min |
| Time to Mitigate | 15 min |
| Time to Resolve | 25 min |
| Responders | 2 |

## Tags

`database` `connection-pool` `capacity` `deployment` `bulk-operations`

Key Insights:

Root cause: Insufficient connection pool sizing
Process gap: No load testing for concurrency changes
5 action items created, 3 high priority
Clear documentation for future prevention

Postmortem: Database Cluster Outage

Incident: INC-20240120-0001 (SEV0, 17 minutes, all users affected)Root Cause Analysis

# Postmortem: Database Cluster Failure

**Incident ID**: INC-20240120-0001
**Severity**: SEV0
**Duration**: 17 minutes
**Impact**: Complete service outage

## Root Cause Analysis

### Primary Root Cause

The primary database cluster failed due to out-of-memory (OOM) conditions 
triggered by a memory leak in the connection pooler software. The leak 
accumulated over 72 hours following a routine Railway platform update.

### Fishbone Diagram Analysis

**Effect**: Database cluster failure

**People**:
- On-call rotation didn't include database specialist

**Process**:
- No staged rollout for platform updates
- Memory monitoring alert threshold too high (95% vs 85%)

**Technology**:
- Pooler software memory leak (vendor bug)
- No circuit breaker on pooler layer
- Insufficient memory headroom for spike

**External**:
- Railway platform update introduced regression

### Contributing Factors

| Factor | Category | Preventable |
|--------|----------|-------------|
| Railway platform update regression | external | Partial |
| Memory alert threshold too high | technology | Yes |
| No database specialist on-call | process | Yes |
| Insufficient memory headroom | technology | Yes |
| No circuit breaker | technology | Yes |

## Action Items

| ID | Type | Description | Owner | Priority |
|----|------|-------------|-------|----------|
| AI-001 | prevent | Lower memory alert to 85% | @platform | P0 |
| AI-002 | prevent | Add database specialist to on-call | @eng-mgr | P0 |
| AI-003 | detect | Implement pooler health checks | @platform | P0 |
| AI-004 | mitigate | Add circuit breaker to pooler | @backend | P1 |
| AI-005 | prevent | Increase memory headroom 20% | @platform | P1 |
| AI-006 | process | Coordinate with Railway on updates | @platform | P1 |

## Lessons Learned

1. **Memory leaks manifest gradually** - Need trend monitoring, not just thresholds
2. **Vendor updates can introduce regressions** - Staged rollout with monitoring
3. **Circuit breakers at every layer** - Even connection poolers need protection
4. **Specialist expertise matters** - Database incidents need database experts

## Metrics

| Metric | Value | Target |
|--------|-------|--------|
| Time to Detect | 1 min | &lt;2 min ✓ |
| Time to Failover | 12 min | &lt;10 min ✗ |
| Time to Resolve | 17 min | &lt;15 min ✗ |
| Customer Communication | 5 min | &lt;5 min ✓ |

Key Insights:

Complex root cause: External vendor bug + insufficient monitoring
Used Fishbone diagram for multi-factor analysis
6 action items spanning detection, prevention, and mitigation
Process improvements around on-call expertise

Postmortem: API Response Time Spike

Incident: INC-20240118-0023 (SEV3, 2 hours, 45 users affected)Analysis Focus: Pattern Recognition

# Postmortem: API Response Time Degradation

**Incident ID**: INC-20240118-0023
**Severity**: SEV3
**Duration**: 2 hours (slow mitigation)

## Root Cause

N+1 query pattern introduced in the workflow listing endpoint during a 
pagination refactor. Each workflow loaded its execution history in a 
separate query, resulting in hundreds of queries per page load.

## Pattern Recognition

This is the **third N+1 query incident in 6 months**:

1. INC-20231015-0008: User listing endpoint
2. INC-20231201-0019: Organization dashboard
3. INC-20240118-0023: Workflow listing (current)

**Common Pattern**: Code reviews missed performance implications of ORM 
queries during pagination refactors.

## Root Cause (Organizational)

While the technical root cause is the N+1 query, the **organizational root 
cause** is the absence of query performance review in our code review 
checklist and lack of automated detection.

## Action Items

| ID | Type | Description | Owner | Priority |
|----|------|-------------|-------|----------|
| AI-001 | detect | Add query count monitoring per endpoint | @platform | P1 |
| AI-002 | prevent | Add ORM query checklist to PR template | @eng-mgr | P1 |
| AI-003 | prevent | Implement query count lint rule in CI | @platform | P0 |
| AI-004 | prevent | Review all listing endpoints for N+1 | @backend | P2 |
| AI-005 | process | Schedule quarterly ORM training | @eng-mgr | P2 |

## Lessons Learned

1. **Recurring patterns need systemic fixes** - Individual fixes aren't enough
2. **Automated detection > manual review** - Humans miss things under pressure
3. **Query performance isn't obvious** - ORM abstraction hides performance
4. **Training matters** - Team needs ongoing education on performance patterns

## Related Incidents

- INC-20231015-0008: N+1 in user listing (SEV3)
- INC-20231201-0019: N+1 in org dashboard (SEV3)

**Recommendation**: Create runbook for "N+1 Query Detection and Prevention"

Key Insights:

Identified recurring pattern (3rd occurrence)
Distinguished technical vs organizational root cause
Emphasized systemic fixes over point solutions
Proposed new runbook creation

Output Format

Postmortem Report Schema

interface PostmortemReport {
  type: "postmortem-report";
  version: "1.0.0";
  generated_by: "postmortem-analyst";
  timestamp: string; // ISO8601
  
  content: {
    incident_id: string; // INC-YYYYMMDD-XXXX
    title: string;
    date: string; // YYYY-MM-DD
    authors: string[];
    reviewers: string[];
    status: "draft" | "review" | "published";
    
    summary: {
      incident_summary: string;
      impact_summary: string;
      duration: string;
      severity: "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4";
      detection_method: string;
    };
    
    timeline: Array<{
      timestamp: string;
      event: string;
      significance: string;
    }>;
    
    root_cause: {
      primary: string;
      analysis_method: "5 Whys" | "Fishbone" | "Systems Analysis";
      analysis_detail: object;
      confidence: number; // 0.0 - 1.0
    };
    
    contributing_factors: Array<{
      factor: string;
      category: "process" | "technology" | "human" | "external";
      preventable: boolean;
    }>;
    
    what_went_well: string[];
    what_went_poorly: string[];
    where_we_got_lucky: string[];
    
    action_items: Array<{
      id: string; // AI-XXX
      type: "mitigate" | "prevent" | "detect" | "process";
      description: string;
      owner: string;
      priority: "P0" | "P1" | "P2" | "P3";
      due_date: string;
      status: "open" | "in_progress" | "completed";
      tracking_url?: string;
    }>;
    
    lessons_learned: string[];
    
    metrics: {
      time_to_detect: string;
      time_to_mitigate: string;
      time_to_resolve: string;
      total_duration: string;
      responder_count: number;
    };
    
    related_incidents: string[];
    tags: string[];
  };
}

Root Cause Analysis Methodologies

5 Whys

Use when: Single, clear failure path

Why did X happen? → Because Y
Why did Y happen? → Because Z
Why did Z happen? → Because W
Why did W happen? → Because V
Why did V happen? → [Root cause]

Example: Connection pool exhaustion (see Usage Examples)

Fishbone (Ishikawa) Diagram

Use when: Multiple contributing factors Categories:

People: Human factors, expertise gaps
Process: Procedures, workflows, communication
Technology: Systems, infrastructure, tools
External: Vendor issues, dependencies

Example: Database cluster failure (see Usage Examples)

Systems Thinking

Use when: Complex, emergent failures Focuses on:

Feedback loops and cascading effects
Organizational dynamics
Latent conditions (pre-existing weaknesses)
Normal accidents (complex system interactions)

Action Item Types

Type	Purpose	Example
Mitigate	Reduce impact if recurs	Add circuit breaker, implement rate limiting
Prevent	Stop from happening again	Fix bug, add validation, update config
Detect	Find issues faster	Add monitoring, create alerts, log key events
Process	Improve workflows	Update runbooks, change review process, add training

Priority Levels

P0 (Critical): Must complete within 1 week, blocks similar incidents
P1 (High): Complete within 2 weeks, significantly reduces risk
P2 (Medium): Complete within 4 weeks, incremental improvement
P3 (Low): Nice to have, low impact on recurrence

FORGE Gate Compliance

Entry Gates (Pre-conditions)

Before invoking this agent, ensure:

Incident resolved: Status set to “closed” by Incident Commander
Record available: Complete incident record with timeline
Responders identified: Key participants available for input
Data accessible: Logs, metrics, code changes retrievable

Verification: Factory Orchestrator confirms incident closure and data availability

Exit Gates (Post-conditions)

This agent completes successfully when:

Root cause identified: Primary cause documented with high confidence
Factors documented: All contributing factors categorized
Timeline complete: Full chronology with significance annotations
Action items created: Specific tasks with owners and due dates
Report published: Postmortem document available in knowledge base
Decision record logged: Analysis rationale in ADR format

Verification: Gatekeeper validates postmortem completeness and action item clarity

Decision Record Format

All significant postmortem conclusions are logged as:

date:2024-01-15T16:00:00Z|context:Analyzing connection pool exhaustion incident|decision:Root cause is undersized pool vs application bug|rationale:5 Whys leads to deployment without load testing, pool size was consequence|consequences:Action items focus on testing and monitoring, not just pool tuning|status:accepted

Integration Points

Control Plane API

Used for:

Reviewing code changes related to incidents
Accessing deployment history
Retrieving performance metrics

No direct API calls - operates on historical data

Veritas Prompt Library

Consumes:

vrt-rca01: Root cause analysis frameworks (5 Whys, Fishbone)
vrt-5whys01: 5 Whys analysis template
vrt-blameless01: Blameless postmortem culture guidelines

Produces:

Incident learnings in veritas/agent-prompts/incident/
Status: draft (requires review)

Repository Integration

All SO1 repos: Code review for root cause investigation
Documentation repo: Publish postmortems and runbooks

Agent	Relationship	Integration Point
Incident Commander	Upstream	Receives incident record after resolution
Triage Responder	Peer	May review triage patterns for improvement
Runbook Writer	Downstream	Creates runbooks from postmortem learnings
Factory Orchestrator	Peer	May request postmortem for pattern analysis

Workflow Process

Data Collection

Gather all incident-related information

Retrieve incident record
Collect relevant logs
Export metrics
Interview responders

Timeline Reconstruction

Build comprehensive incident timeline

Chronological event ordering
Identify key decision points
Annotate significance
Map communication flow

Root Cause Analysis

Apply RCA methodology

Select appropriate method (5 Whys, Fishbone, Systems)
Conduct analysis
Identify contributing factors
Validate conclusions

Action Item Generation

Create improvement actions

Identify prevention opportunities
Prioritize by impact
Assign owners
Set due dates

Report Writing

Produce final postmortem document

Write narrative summary
Document analysis
Include all sections
Publish to knowledge base

Blameless Culture Principles

Critical: All postmortems must follow blameless principles

Focus on systems and processes, not individuals
Assume good intentions from all responders
Seek organizational learning, not punishment
Ask “what went wrong?” not “who made a mistake?”
Create psychological safety for honest discussion

Blameless Language Examples

❌ Blame-focused	✅ Blameless
”Engineer X deployed buggy code"	"Deployment lacked sufficient testing"
"On-call failed to respond quickly"	"Alert fatigue delayed response"
"Team ignored warnings"	"Warning signals were unclear”

Error Handling

Common Issues

Insufficient Data for RCACause: Logs rotated, metrics not retained, responders unavailableResolution: Document data gaps, use available evidence, note confidence level <70%

Multiple Plausible Root CausesCause: Complex incident with unclear causalityResolution: Use Fishbone diagram, document all contributing factors, assign confidence levels

Recurrence of Known IssuesCause: Previous action items not completedResolution: Highlight pattern, escalate incomplete action items, recommend process changes

Escalation Path

If Postmortem Analyst cannot complete analysis:

Document partial analysis with confidence levels
Escalate to engineering leadership for additional investigation
Schedule follow-up analysis with required experts
Log decision record noting blockers

Success Metrics

Metric	Target	Critical Threshold
Postmortem Completion Time	<48 hours	>72 hours
Action Item Completion Rate	>90%	<75%
Root Cause Confidence	>85%	<70%
Recurrence Rate	<10%	>25%

Source Files

View Agent Source

Maintained in so1-agents repository under agents/incident/postmortem-analyst.md

Agents Overview

Orchestration

Automation

Engineering

DevOps

Documentation

Prompts

Incident

​Quick Reference

​Core Capabilities

Root Cause Analysis

Timeline Analysis

Action Item Generation

Knowledge Capture

​When to Use

​Ideal Use Cases

​Not Recommended For

​Usage Examples

​Postmortem: API Gateway Connection Pool

​Postmortem: Database Cluster Outage

​Postmortem: API Response Time Spike

​Output Format

​Postmortem Report Schema

​Root Cause Analysis Methodologies

​5 Whys

​Fishbone (Ishikawa) Diagram

​Systems Thinking

​Action Item Types

​Priority Levels

​FORGE Gate Compliance

​Integration Points

​Control Plane API

​Veritas Prompt Library

​Repository Integration

​Related Agents

​Workflow Process

​Blameless Culture Principles

​Blameless Language Examples

​Error Handling

​Common Issues

​Escalation Path

​Success Metrics

​Source Files

View Agent Source

Quick Reference

Core Capabilities

When to Use

Ideal Use Cases

Not Recommended For

Usage Examples

Postmortem: API Gateway Connection Pool

Postmortem: Database Cluster Outage

Postmortem: API Response Time Spike

Output Format

Postmortem Report Schema

Root Cause Analysis Methodologies

5 Whys

Fishbone (Ishikawa) Diagram

Systems Thinking

Action Item Types

Priority Levels

FORGE Gate Compliance

Integration Points

Control Plane API

Veritas Prompt Library

Repository Integration

Related Agents

Workflow Process

Blameless Culture Principles

Blameless Language Examples

Error Handling

Common Issues

Escalation Path

Success Metrics

Source Files