Incident Commander - SO1 Documentation

Quick Reference

Property	Value
Domain	Incident
FORGE Stage	Cross-cutting (operates during active incidents)
Version	1.0.0
Primary Output	Incident records, response coordination, communications

Use this agent when you need to:

Coordinate incident response activities across teams
Manage stakeholder communications during outages
Track incident timeline from detection to resolution
Make escalation decisions based on severity and impact

Core Capabilities

Incident Orchestration

Coordinates response activities across teams and systems with clear accountability

Communication Management

Ensures stakeholders receive timely, accurate updates via appropriate channels

Escalation Handling

Makes escalation decisions based on severity, impact, and SLA requirements

Resolution Tracking

Tracks progress toward resolution and documents all actions taken

When to Use

Ideal Use Cases

SEV0-SEV2 incidents requiring coordinated response

Incidents affecting multiple services or teams

Customer-facing outages needing communication management

Complex incidents with unclear root cause requiring investigation coordination

Incidents approaching SLA thresholds requiring escalation

Not Recommended For

Usage Examples

API Gateway Outage
Complete Outage
Performance Degradation

SEV2: API Gateway 5xx Errors

Scenario: Database connection pool exhaustion causing API failuresIncident Record

{
  "incident_id": "INC-20240115-0042",
  "title": "API Gateway 5xx errors spike affecting workflow executions",
  "severity": "SEV2",
  "status": "resolved",
  "commander": "oncall-backend",
  "timeline": [
    {
      "timestamp": "2024-01-15T14:05:00Z",
      "event": "PagerDuty alert: API error rate >5%",
      "actor": "monitoring",
      "category": "detection"
    },
    {
      "timestamp": "2024-01-15T14:08:00Z",
      "event": "Incident declared, war room opened",
      "actor": "oncall-backend",
      "category": "action"
    },
    {
      "timestamp": "2024-01-15T14:15:00Z",
      "event": "Identified: Database connection pool exhaustion",
      "actor": "oncall-backend",
      "category": "investigation"
    },
    {
      "timestamp": "2024-01-15T14:20:00Z",
      "event": "Mitigation: Increased connection pool size via Railway",
      "actor": "oncall-backend",
      "category": "action"
    },
    {
      "timestamp": "2024-01-15T14:25:00Z",
      "event": "Error rate returning to normal",
      "actor": "monitoring",
      "category": "resolution"
    }
  ],
  "impact": {
    "users_affected": 150,
    "services_affected": ["so1-control-plane-api", "workflow-executions"],
    "revenue_impact": "low",
    "data_impact": "availability"
  },
  "actions": [
    {
      "id": "ACT-001",
      "description": "Increase default connection pool size in Railway config",
      "assignee": "platform-team",
      "status": "completed"
    },
    {
      "id": "ACT-002",
      "description": "Add connection pool exhaustion alert",
      "assignee": "oncall-backend",
      "status": "pending",
      "due": "2024-01-16T17:00:00Z"
    }
  ],
  "communications": [
    {
      "timestamp": "2024-01-15T14:12:00Z",
      "channel": "slack",
      "audience": "internal",
      "message": "🔴 SEV2 Incident declared: API errors affecting workflow executions. Investigating."
    },
    {
      "timestamp": "2024-01-15T14:35:00Z",
      "channel": "slack",
      "audience": "internal",
      "message": "✅ Incident resolved: Connection pool issue mitigated. Monitoring stable."
    }
  ],
  "resolution": {
    "summary": "Increased database connection pool size from 20 to 50 connections",
    "root_cause": "Spike in concurrent workflow executions exceeded connection pool capacity",
    "time_to_detect": "5 minutes",
    "time_to_mitigate": "15 minutes",
    "time_to_resolve": "25 minutes"
  }
}

Key Actions:

Declared incident within 3 minutes of alert
Coordinated investigation (logs, metrics, recent changes)
Applied mitigation (pool size increase) in 15 minutes
Communicated status to stakeholders
Created follow-up action items

SEV0: Database Cluster Failure

Scenario: Primary database cluster failed, complete service unavailabilityResponse Workflow

{
  "incident_id": "INC-20240120-0001",
  "title": "Database cluster failure - complete service outage",
  "severity": "SEV0",
  "status": "mitigating",
  "commander": "engineering-lead",
  "timeline": [
    {
      "timestamp": "2024-01-20T09:15:00Z",
      "event": "All services reporting database connection failures",
      "actor": "monitoring",
      "category": "detection"
    },
    {
      "timestamp": "2024-01-20T09:16:00Z",
      "event": "SEV0 declared, executive escalation triggered",
      "actor": "oncall-platform",
      "category": "action"
    },
    {
      "timestamp": "2024-01-20T09:18:00Z",
      "event": "All-hands war room established",
      "actor": "engineering-lead",
      "category": "action"
    },
    {
      "timestamp": "2024-01-20T09:25:00Z",
      "event": "Initiating failover to secondary cluster",
      "actor": "platform-team",
      "category": "action"
    },
    {
      "timestamp": "2024-01-20T09:32:00Z",
      "event": "Failover complete, services recovering",
      "actor": "platform-team",
      "category": "resolution"
    }
  ],
  "impact": {
    "users_affected": "all",
    "services_affected": ["all"],
    "revenue_impact": "critical",
    "data_impact": "availability"
  },
  "communications": [
    {
      "timestamp": "2024-01-20T09:20:00Z",
      "channel": "status_page",
      "audience": "customers",
      "message": "Major Outage: We are experiencing a complete service outage. All services are unavailable. Our team is actively working on resolution."
    },
    {
      "timestamp": "2024-01-20T09:21:00Z",
      "channel": "slack",
      "audience": "internal",
      "message": "🚨 SEV0 INCIDENT: Database cluster failure. All hands on deck. War room: #incident-sev0"
    },
    {
      "timestamp": "2024-01-20T09:35:00Z",
      "channel": "status_page",
      "audience": "customers",
      "message": "Update: Failover to backup infrastructure complete. Services are recovering. Monitoring for stability."
    }
  ],
  "decisions": [
    "date:2024-01-20T09:23:00Z|context:Primary cluster unresponsive|decision:Initiate immediate failover vs attempt recovery|rationale:Recovery time uncertain, customer impact critical|consequences:Faster resolution, possible data lag|status:accepted"
  ]
}

Escalation Actions:

Immediate executive notification
All-hands war room mobilization
Customer status page update within 5 minutes
Failover decision documented with rationale
Continuous updates every 15 minutes

SEV2: API Response Time Spike

Scenario: API response times increased 300%, intermittent timeoutsCommunication Flow

{
  "incident_id": "INC-20240118-0023",
  "title": "API response time degradation affecting workflow triggers",
  "severity": "SEV2",
  "status": "identified",
  "commander": "oncall-backend",
  "timeline": [
    {
      "timestamp": "2024-01-18T16:30:00Z",
      "event": "Alert: P95 response time >5s (threshold: 1.5s)",
      "actor": "monitoring",
      "category": "detection"
    },
    {
      "timestamp": "2024-01-18T16:35:00Z",
      "event": "Incident declared SEV2",
      "actor": "oncall-backend",
      "category": "action"
    },
    {
      "timestamp": "2024-01-18T16:45:00Z",
      "event": "Identified: N+1 query pattern in workflow listing endpoint",
      "actor": "oncall-backend",
      "category": "investigation"
    },
    {
      "timestamp": "2024-01-18T16:50:00Z",
      "event": "Adding database index as mitigation",
      "actor": "oncall-backend",
      "category": "action"
    }
  ],
  "impact": {
    "users_affected": 45,
    "services_affected": ["workflow-api"],
    "revenue_impact": "low",
    "data_impact": "none"
  },
  "communications": [
    {
      "timestamp": "2024-01-18T16:38:00Z",
      "channel": "slack",
      "audience": "internal",
      "message": "🟡 SEV2: API response times degraded. Users may experience slower workflow loading. Investigating query performance."
    },
    {
      "timestamp": "2024-01-18T16:55:00Z",
      "channel": "slack",
      "audience": "internal",
      "message": "⚠️ Update: Database index added. Response times improving. Monitoring for 15 minutes before resolving."
    }
  ],
  "actions": [
    {
      "id": "ACT-001",
      "description": "Add composite index on workflows table",
      "assignee": "oncall-backend",
      "status": "completed"
    },
    {
      "id": "ACT-002",
      "description": "Review all listing endpoints for N+1 queries",
      "assignee": "backend-team",
      "status": "pending",
      "due": "2024-01-25T17:00:00Z"
    }
  ]
}

Commander Decisions:

Assessed SEV2 (degraded but functional)
Notified affected teams via Slack
Coordinated database index addition
Scheduled follow-up query optimization review
No customer-facing communication (internal impact only)

Output Format

Incident Record Schema

interface IncidentRecord {
  type: "incident-record";
  version: "1.0.0";
  generated_by: "incident-commander";
  timestamp: string; // ISO8601
  
  content: {
    incident_id: string; // INC-YYYYMMDD-XXXX
    title: string;
    severity: "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4";
    status: "detected" | "investigating" | "identified" | "mitigating" | "resolved" | "closed";
    commander: string; // Responder ID
    
    timeline: Array<{
      timestamp: string;
      event: string;
      actor: string; // Person or system
      category: "detection" | "investigation" | "action" | "communication" | "resolution";
    }>;
    
    impact: {
      users_affected: number | "all";
      services_affected: string[];
      revenue_impact: "none" | "low" | "medium" | "high" | "critical";
      data_impact: "none" | "integrity" | "availability" | "confidentiality";
    };
    
    actions: Array<{
      id: string; // ACT-XXX
      description: string;
      assignee: string;
      status: "pending" | "in_progress" | "completed" | "blocked";
      due?: string;
    }>;
    
    communications: Array<{
      timestamp: string;
      channel: "slack" | "email" | "status_page" | "phone";
      audience: "internal" | "customers" | "all";
      message: string;
    }>;
    
    decisions: string[]; // ADR format
    
    resolution?: {
      summary: string;
      root_cause: string;
      time_to_detect: string;
      time_to_mitigate: string;
      time_to_resolve: string;
    };
  };
}

Severity Definitions

Level	Criteria	Response Time	Commander	Escalation
SEV0	Complete outage, data loss, security breach	Immediate	Senior engineer + management	Executive + all hands
SEV1	Major degradation, significant user impact	<5 minutes	Senior engineer	Engineering leadership
SEV2	Partial functionality loss, moderate impact	<15 minutes	On-call engineer	Team lead (if prolonged)
SEV3	Minor issues, workarounds available	<1 hour	On-call engineer	None (standard queue)
SEV4	Cosmetic issues, no functional impact	Best effort	Any engineer	None (backlog)

Communication Templates

Internal Update (Slack)

🔴 **SEV{X} Incident: {title}**

**Status**: {investigating|identified|mitigating|resolved}
**Impact**: {impact_description}
**Current actions**: {what_we_are_doing}
**ETA**: {estimated_resolution_time}

Commander: @{commander}
War room: #{channel}

Customer Communication (Status Page)

**Incident: {title}**

We are currently experiencing {issue_description}.

**Impact**: {customer_impact}
**Status**: {current_status}
**Next update**: {timestamp}

We apologize for any inconvenience and will provide updates as we work toward resolution.

Executive Briefing (SEV0-SEV1)

**CRITICAL INCIDENT BRIEF**

**Incident**: {title}
**Severity**: {SEV0|SEV1}
**Impact**: {users_affected} users, {revenue_impact} revenue impact
**Status**: {current_status}

**Actions Taken**:
- {action_1}
- {action_2}

**ETA to Resolution**: {estimate}
**Commander**: {name}

FORGE Gate Compliance

Entry Gates (Pre-conditions)

Before invoking this agent, ensure:

Incident detected: Automated alert or manual declaration received
Severity assessed: Triage Responder has classified severity (SEV0-4)
Channel established: War room or incident channel created
Responder notified: On-call engineer(s) paged and available

Verification: Factory Orchestrator confirms triage complete before Commander activation

Exit Gates (Post-conditions)

This agent completes successfully when:

Incident resolved: Service restored or stable mitigation in place
Actions documented: All action items captured and assigned
Communications sent: Stakeholders notified of resolution
Timeline complete: Full incident chronology documented
Handoff initiated: Postmortem Analyst engaged for analysis
Decision record logged: Key decisions documented in ADR format

Verification: Gatekeeper validates completeness before closing incident

Decision Record Format

All critical incident decisions are logged as:

date:2024-01-20T09:23:00Z|context:Primary cluster unresponsive for 8 minutes|decision:Initiate immediate failover to secondary cluster|rationale:Recovery time uncertain, customer impact critical, RTO exceeded|consequences:Faster resolution, possible data lag up to 5 minutes|status:accepted

Integration Points

Control Plane API

Used Endpoints:

GET /api/v1/health - Service health status checks
GET /api/v1/workflows/{id}/executions - Execution history for affected workflows
POST /api/v1/workflows/{id}/pause - Pause workflows during mitigation
POST /api/incidents - Create incident records
PATCH /api/incidents/:id - Update incident status

n8n Workflow Integration

Pause workflows: Stop affected workflows during incidents
Alert enrichment: Fetch execution logs for context
Notification dispatch: Trigger stakeholder alerts via n8n workflows
Timeline updates: Real-time incident status updates

Veritas Prompt Library

Consumes:

vrt-incident01: Incident response playbook templates
vrt-comms01: Stakeholder communication templates
vrt-escalate01: Escalation decision criteria

Produces:

Novel incident response tasks in veritas/agent-prompts/incident/
Status: draft (requires review)

Agent	Relationship	Integration Point
Triage Responder	Upstream	Receives severity assessment and initial context
Postmortem Analyst	Downstream	Hands off resolved incident for RCA
Factory Orchestrator	Peer	Escalates for multi-agent coordination needs
Railway Deployer	Consumer	May request rollbacks or infrastructure changes

Workflow Process

Incident Declaration

Formally declare and initialize the incident

Assign incident ID (INC-YYYYMMDD-XXXX)
Create war room/channel
Set initial severity
Assign commander

Investigation Coordination

Coordinate investigation activities

Assign investigation tasks
Identify relevant logs/metrics
Start timeline tracking
Document hypotheses

Communication Management

Manage stakeholder communications

Post internal updates (Slack)
Send customer communications (status page)
Executive briefings (SEV0-1)
Regular status updates

Mitigation/Resolution

Coordinate mitigation and resolution

Execute mitigation actions
Verify resolution
Confirm monitoring stable
Prepare rollback plan (if needed)

Handoff

Complete incident and hand off

Finalize timeline
Document action items
Schedule postmortem
Close incident record

Error Handling

Common Issues

Escalation DelayCause: Severity underestimated, incident worseningResolution: Re-assess severity, escalate immediately, update communications

Communication GapCause: Stakeholders not receiving updatesResolution: Establish regular cadence (every 15-30 min), use multiple channels

Action Item LossCause: Actions discussed but not documentedResolution: Real-time documentation, assign owners immediately, track in incident record

Escalation Path

If Commander cannot manage incident effectively:

Escalate to engineering leadership (SEV0-1)
Request additional responders if needed
Hand off command if commander unavailable
Engage Factory Orchestrator for multi-domain issues

Success Metrics

Metric	Target	Critical Threshold
Time to Declaration	<5 minutes	>10 minutes
First Update Latency	<10 minutes	>20 minutes
Action Item Capture	100%	<90%
Communication Frequency	Every 15-30 min	>1 hour gaps

Source Files

View Agent Source

Maintained in so1-agents repository under agents/incident/incident-commander.md

Agents Overview

Orchestration

Automation

Engineering

DevOps

Documentation

Prompts

Incident

​Quick Reference

​Core Capabilities

Incident Orchestration

Communication Management

Escalation Handling

Resolution Tracking

​When to Use

​Ideal Use Cases

​Not Recommended For

​Usage Examples

​SEV2: API Gateway 5xx Errors

​SEV0: Database Cluster Failure

​SEV2: API Response Time Spike

​Output Format

​Incident Record Schema

​Severity Definitions

​Communication Templates

​Internal Update (Slack)

​Customer Communication (Status Page)

​Executive Briefing (SEV0-SEV1)

​FORGE Gate Compliance

​Integration Points

​Control Plane API

​n8n Workflow Integration

​Veritas Prompt Library

​Related Agents

​Workflow Process

​Error Handling

​Common Issues

​Escalation Path

​Success Metrics

​Source Files

View Agent Source

Quick Reference

Core Capabilities

When to Use

Ideal Use Cases

Not Recommended For

Usage Examples

SEV2: API Gateway 5xx Errors

SEV0: Database Cluster Failure

SEV2: API Response Time Spike

Output Format

Incident Record Schema

Severity Definitions

Communication Templates

Internal Update (Slack)

Customer Communication (Status Page)

Executive Briefing (SEV0-SEV1)

FORGE Gate Compliance

Integration Points

Control Plane API

n8n Workflow Integration

Veritas Prompt Library

Related Agents

Workflow Process

Error Handling

Common Issues

Escalation Path

Success Metrics

Source Files