Quick Reference
Property Value Domain Incident FORGE Stage Cross-cutting (operates during active incidents) Version 1.0.0 Primary Output Incident records, response coordination, communications
Use this agent when you need to:
Coordinate incident response activities across teams
Manage stakeholder communications during outages
Track incident timeline from detection to resolution
Make escalation decisions based on severity and impact
Core Capabilities
Incident Orchestration Coordinates response activities across teams and systems with clear accountability
Communication Management Ensures stakeholders receive timely, accurate updates via appropriate channels
Escalation Handling Makes escalation decisions based on severity, impact, and SLA requirements
Resolution Tracking Tracks progress toward resolution and documents all actions taken
When to Use
Ideal Use Cases
SEV0-SEV2 incidents requiring coordinated response
Incidents affecting multiple services or teams
Customer-facing outages needing communication management
Complex incidents with unclear root cause requiring investigation coordination
Incidents approaching SLA thresholds requiring escalation
Not Recommended For
Usage Examples
API Gateway Outage
Complete Outage
Performance Degradation
SEV2: API Gateway 5xx Errors Scenario : Database connection pool exhaustion causing API failuresIncident Record {
"incident_id" : "INC-20240115-0042" ,
"title" : "API Gateway 5xx errors spike affecting workflow executions" ,
"severity" : "SEV2" ,
"status" : "resolved" ,
"commander" : "oncall-backend" ,
"timeline" : [
{
"timestamp" : "2024-01-15T14:05:00Z" ,
"event" : "PagerDuty alert: API error rate >5%" ,
"actor" : "monitoring" ,
"category" : "detection"
},
{
"timestamp" : "2024-01-15T14:08:00Z" ,
"event" : "Incident declared, war room opened" ,
"actor" : "oncall-backend" ,
"category" : "action"
},
{
"timestamp" : "2024-01-15T14:15:00Z" ,
"event" : "Identified: Database connection pool exhaustion" ,
"actor" : "oncall-backend" ,
"category" : "investigation"
},
{
"timestamp" : "2024-01-15T14:20:00Z" ,
"event" : "Mitigation: Increased connection pool size via Railway" ,
"actor" : "oncall-backend" ,
"category" : "action"
},
{
"timestamp" : "2024-01-15T14:25:00Z" ,
"event" : "Error rate returning to normal" ,
"actor" : "monitoring" ,
"category" : "resolution"
}
],
"impact" : {
"users_affected" : 150 ,
"services_affected" : [ "so1-control-plane-api" , "workflow-executions" ],
"revenue_impact" : "low" ,
"data_impact" : "availability"
},
"actions" : [
{
"id" : "ACT-001" ,
"description" : "Increase default connection pool size in Railway config" ,
"assignee" : "platform-team" ,
"status" : "completed"
},
{
"id" : "ACT-002" ,
"description" : "Add connection pool exhaustion alert" ,
"assignee" : "oncall-backend" ,
"status" : "pending" ,
"due" : "2024-01-16T17:00:00Z"
}
],
"communications" : [
{
"timestamp" : "2024-01-15T14:12:00Z" ,
"channel" : "slack" ,
"audience" : "internal" ,
"message" : "🔴 SEV2 Incident declared: API errors affecting workflow executions. Investigating."
},
{
"timestamp" : "2024-01-15T14:35:00Z" ,
"channel" : "slack" ,
"audience" : "internal" ,
"message" : "✅ Incident resolved: Connection pool issue mitigated. Monitoring stable."
}
],
"resolution" : {
"summary" : "Increased database connection pool size from 20 to 50 connections" ,
"root_cause" : "Spike in concurrent workflow executions exceeded connection pool capacity" ,
"time_to_detect" : "5 minutes" ,
"time_to_mitigate" : "15 minutes" ,
"time_to_resolve" : "25 minutes"
}
}
Key Actions :
Declared incident within 3 minutes of alert
Coordinated investigation (logs, metrics, recent changes)
Applied mitigation (pool size increase) in 15 minutes
Communicated status to stakeholders
Created follow-up action items
SEV0: Database Cluster Failure Scenario : Primary database cluster failed, complete service unavailabilityResponse Workflow {
"incident_id" : "INC-20240120-0001" ,
"title" : "Database cluster failure - complete service outage" ,
"severity" : "SEV0" ,
"status" : "mitigating" ,
"commander" : "engineering-lead" ,
"timeline" : [
{
"timestamp" : "2024-01-20T09:15:00Z" ,
"event" : "All services reporting database connection failures" ,
"actor" : "monitoring" ,
"category" : "detection"
},
{
"timestamp" : "2024-01-20T09:16:00Z" ,
"event" : "SEV0 declared, executive escalation triggered" ,
"actor" : "oncall-platform" ,
"category" : "action"
},
{
"timestamp" : "2024-01-20T09:18:00Z" ,
"event" : "All-hands war room established" ,
"actor" : "engineering-lead" ,
"category" : "action"
},
{
"timestamp" : "2024-01-20T09:25:00Z" ,
"event" : "Initiating failover to secondary cluster" ,
"actor" : "platform-team" ,
"category" : "action"
},
{
"timestamp" : "2024-01-20T09:32:00Z" ,
"event" : "Failover complete, services recovering" ,
"actor" : "platform-team" ,
"category" : "resolution"
}
],
"impact" : {
"users_affected" : "all" ,
"services_affected" : [ "all" ],
"revenue_impact" : "critical" ,
"data_impact" : "availability"
},
"communications" : [
{
"timestamp" : "2024-01-20T09:20:00Z" ,
"channel" : "status_page" ,
"audience" : "customers" ,
"message" : "Major Outage: We are experiencing a complete service outage. All services are unavailable. Our team is actively working on resolution."
},
{
"timestamp" : "2024-01-20T09:21:00Z" ,
"channel" : "slack" ,
"audience" : "internal" ,
"message" : "🚨 SEV0 INCIDENT: Database cluster failure. All hands on deck. War room: #incident-sev0"
},
{
"timestamp" : "2024-01-20T09:35:00Z" ,
"channel" : "status_page" ,
"audience" : "customers" ,
"message" : "Update: Failover to backup infrastructure complete. Services are recovering. Monitoring for stability."
}
],
"decisions" : [
"date:2024-01-20T09:23:00Z|context:Primary cluster unresponsive|decision:Initiate immediate failover vs attempt recovery|rationale:Recovery time uncertain, customer impact critical|consequences:Faster resolution, possible data lag|status:accepted"
]
}
Escalation Actions :
Immediate executive notification
All-hands war room mobilization
Customer status page update within 5 minutes
Failover decision documented with rationale
Continuous updates every 15 minutes
SEV2: API Response Time Spike Scenario : API response times increased 300%, intermittent timeoutsCommunication Flow {
"incident_id" : "INC-20240118-0023" ,
"title" : "API response time degradation affecting workflow triggers" ,
"severity" : "SEV2" ,
"status" : "identified" ,
"commander" : "oncall-backend" ,
"timeline" : [
{
"timestamp" : "2024-01-18T16:30:00Z" ,
"event" : "Alert: P95 response time >5s (threshold: 1.5s)" ,
"actor" : "monitoring" ,
"category" : "detection"
},
{
"timestamp" : "2024-01-18T16:35:00Z" ,
"event" : "Incident declared SEV2" ,
"actor" : "oncall-backend" ,
"category" : "action"
},
{
"timestamp" : "2024-01-18T16:45:00Z" ,
"event" : "Identified: N+1 query pattern in workflow listing endpoint" ,
"actor" : "oncall-backend" ,
"category" : "investigation"
},
{
"timestamp" : "2024-01-18T16:50:00Z" ,
"event" : "Adding database index as mitigation" ,
"actor" : "oncall-backend" ,
"category" : "action"
}
],
"impact" : {
"users_affected" : 45 ,
"services_affected" : [ "workflow-api" ],
"revenue_impact" : "low" ,
"data_impact" : "none"
},
"communications" : [
{
"timestamp" : "2024-01-18T16:38:00Z" ,
"channel" : "slack" ,
"audience" : "internal" ,
"message" : "🟡 SEV2: API response times degraded. Users may experience slower workflow loading. Investigating query performance."
},
{
"timestamp" : "2024-01-18T16:55:00Z" ,
"channel" : "slack" ,
"audience" : "internal" ,
"message" : "⚠️ Update: Database index added. Response times improving. Monitoring for 15 minutes before resolving."
}
],
"actions" : [
{
"id" : "ACT-001" ,
"description" : "Add composite index on workflows table" ,
"assignee" : "oncall-backend" ,
"status" : "completed"
},
{
"id" : "ACT-002" ,
"description" : "Review all listing endpoints for N+1 queries" ,
"assignee" : "backend-team" ,
"status" : "pending" ,
"due" : "2024-01-25T17:00:00Z"
}
]
}
Commander Decisions :
Assessed SEV2 (degraded but functional)
Notified affected teams via Slack
Coordinated database index addition
Scheduled follow-up query optimization review
No customer-facing communication (internal impact only)
Incident Record Schema
interface IncidentRecord {
type : "incident-record" ;
version : "1.0.0" ;
generated_by : "incident-commander" ;
timestamp : string ; // ISO8601
content : {
incident_id : string ; // INC-YYYYMMDD-XXXX
title : string ;
severity : "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4" ;
status : "detected" | "investigating" | "identified" | "mitigating" | "resolved" | "closed" ;
commander : string ; // Responder ID
timeline : Array <{
timestamp : string ;
event : string ;
actor : string ; // Person or system
category : "detection" | "investigation" | "action" | "communication" | "resolution" ;
}>;
impact : {
users_affected : number | "all" ;
services_affected : string [];
revenue_impact : "none" | "low" | "medium" | "high" | "critical" ;
data_impact : "none" | "integrity" | "availability" | "confidentiality" ;
};
actions : Array <{
id : string ; // ACT-XXX
description : string ;
assignee : string ;
status : "pending" | "in_progress" | "completed" | "blocked" ;
due ?: string ;
}>;
communications : Array <{
timestamp : string ;
channel : "slack" | "email" | "status_page" | "phone" ;
audience : "internal" | "customers" | "all" ;
message : string ;
}>;
decisions : string []; // ADR format
resolution ?: {
summary : string ;
root_cause : string ;
time_to_detect : string ;
time_to_mitigate : string ;
time_to_resolve : string ;
};
};
}
Severity Definitions
Level Criteria Response Time Commander Escalation SEV0 Complete outage, data loss, security breach Immediate Senior engineer + management Executive + all hands SEV1 Major degradation, significant user impact <5 minutes Senior engineer Engineering leadership SEV2 Partial functionality loss, moderate impact <15 minutes On-call engineer Team lead (if prolonged) SEV3 Minor issues, workarounds available <1 hour On-call engineer None (standard queue) SEV4 Cosmetic issues, no functional impact Best effort Any engineer None (backlog)
Communication Templates
Internal Update (Slack)
🔴 **SEV{X} Incident: {title}**
**Status** : {investigating|identified|mitigating|resolved}
**Impact** : {impact_description}
**Current actions** : {what_we_are_doing}
**ETA** : {estimated_resolution_time}
Commander: @{commander}
War room: #{channel}
Customer Communication (Status Page)
**Incident: {title}**
We are currently experiencing {issue_description}.
**Impact** : {customer_impact}
**Status** : {current_status}
**Next update** : {timestamp}
We apologize for any inconvenience and will provide updates as we work toward resolution.
Executive Briefing (SEV0-SEV1)
**CRITICAL INCIDENT BRIEF**
**Incident** : {title}
**Severity** : {SEV0|SEV1}
**Impact** : {users_affected} users, {revenue_impact} revenue impact
**Status** : {current_status}
**Actions Taken** :
- {action_1}
- {action_2}
**ETA to Resolution** : {estimate}
**Commander** : {name}
FORGE Gate Compliance
Entry Gates (Pre-conditions)
Before invoking this agent, ensure:
Incident detected : Automated alert or manual declaration received
Severity assessed : Triage Responder has classified severity (SEV0-4)
Channel established : War room or incident channel created
Responder notified : On-call engineer(s) paged and available
Verification : Factory Orchestrator confirms triage complete before Commander activation
Exit Gates (Post-conditions)
This agent completes successfully when:
Incident resolved : Service restored or stable mitigation in place
Actions documented : All action items captured and assigned
Communications sent : Stakeholders notified of resolution
Timeline complete : Full incident chronology documented
Handoff initiated : Postmortem Analyst engaged for analysis
Decision record logged : Key decisions documented in ADR format
Verification : Gatekeeper validates completeness before closing incident
Integration Points
Control Plane API
Used Endpoints:
GET /api/v1/health - Service health status checks
GET /api/v1/workflows/{id}/executions - Execution history for affected workflows
POST /api/v1/workflows/{id}/pause - Pause workflows during mitigation
POST /api/incidents - Create incident records
PATCH /api/incidents/:id - Update incident status
n8n Workflow Integration
Pause workflows : Stop affected workflows during incidents
Alert enrichment : Fetch execution logs for context
Notification dispatch : Trigger stakeholder alerts via n8n workflows
Timeline updates : Real-time incident status updates
Veritas Prompt Library
Consumes:
vrt-incident01: Incident response playbook templates
vrt-comms01: Stakeholder communication templates
vrt-escalate01: Escalation decision criteria
Produces:
Novel incident response tasks in veritas/agent-prompts/incident/
Status: draft (requires review)
Agent Relationship Integration Point Triage Responder Upstream Receives severity assessment and initial context Postmortem Analyst Downstream Hands off resolved incident for RCA Factory Orchestrator Peer Escalates for multi-agent coordination needs Railway Deployer Consumer May request rollbacks or infrastructure changes
Workflow Process
Incident Declaration
Formally declare and initialize the incident
Assign incident ID (INC-YYYYMMDD-XXXX)
Create war room/channel
Set initial severity
Assign commander
Investigation Coordination
Coordinate investigation activities
Assign investigation tasks
Identify relevant logs/metrics
Start timeline tracking
Document hypotheses
Communication Management
Manage stakeholder communications
Post internal updates (Slack)
Send customer communications (status page)
Executive briefings (SEV0-1)
Regular status updates
Mitigation/Resolution
Coordinate mitigation and resolution
Execute mitigation actions
Verify resolution
Confirm monitoring stable
Prepare rollback plan (if needed)
Handoff
Complete incident and hand off
Finalize timeline
Document action items
Schedule postmortem
Close incident record
Error Handling
Common Issues
Escalation Delay Cause : Severity underestimated, incident worseningResolution : Re-assess severity, escalate immediately, update communications
Communication Gap Cause : Stakeholders not receiving updatesResolution : Establish regular cadence (every 15-30 min), use multiple channels
Action Item Loss Cause : Actions discussed but not documentedResolution : Real-time documentation, assign owners immediately, track in incident record
Escalation Path
If Commander cannot manage incident effectively:
Escalate to engineering leadership (SEV0-1)
Request additional responders if needed
Hand off command if commander unavailable
Engage Factory Orchestrator for multi-domain issues
Success Metrics
Metric Target Critical Threshold Time to Declaration <5 minutes >10 minutes First Update Latency <10 minutes >20 minutes Action Item Capture 100% <90% Communication Frequency Every 15-30 min >1 hour gaps
Source Files
View Agent Source Maintained in so1-agents repository under agents/incident/incident-commander.md