Skip to main content

Quick Reference

PropertyValue
DomainIncident
FORGE StageCross-cutting (operates during active incidents)
Version1.0.0
Primary OutputIncident records, response coordination, communications
Use this agent when you need to:
  • Coordinate incident response activities across teams
  • Manage stakeholder communications during outages
  • Track incident timeline from detection to resolution
  • Make escalation decisions based on severity and impact

Core Capabilities

Incident Orchestration

Coordinates response activities across teams and systems with clear accountability

Communication Management

Ensures stakeholders receive timely, accurate updates via appropriate channels

Escalation Handling

Makes escalation decisions based on severity, impact, and SLA requirements

Resolution Tracking

Tracks progress toward resolution and documents all actions taken

When to Use

Ideal Use Cases

SEV0-SEV2 incidents requiring coordinated response
Incidents affecting multiple services or teams
Customer-facing outages needing communication management
Complex incidents with unclear root cause requiring investigation coordination
Incidents approaching SLA thresholds requiring escalation

Usage Examples

SEV2: API Gateway 5xx Errors

Scenario: Database connection pool exhaustion causing API failuresIncident Record
{
  "incident_id": "INC-20240115-0042",
  "title": "API Gateway 5xx errors spike affecting workflow executions",
  "severity": "SEV2",
  "status": "resolved",
  "commander": "oncall-backend",
  "timeline": [
    {
      "timestamp": "2024-01-15T14:05:00Z",
      "event": "PagerDuty alert: API error rate >5%",
      "actor": "monitoring",
      "category": "detection"
    },
    {
      "timestamp": "2024-01-15T14:08:00Z",
      "event": "Incident declared, war room opened",
      "actor": "oncall-backend",
      "category": "action"
    },
    {
      "timestamp": "2024-01-15T14:15:00Z",
      "event": "Identified: Database connection pool exhaustion",
      "actor": "oncall-backend",
      "category": "investigation"
    },
    {
      "timestamp": "2024-01-15T14:20:00Z",
      "event": "Mitigation: Increased connection pool size via Railway",
      "actor": "oncall-backend",
      "category": "action"
    },
    {
      "timestamp": "2024-01-15T14:25:00Z",
      "event": "Error rate returning to normal",
      "actor": "monitoring",
      "category": "resolution"
    }
  ],
  "impact": {
    "users_affected": 150,
    "services_affected": ["so1-control-plane-api", "workflow-executions"],
    "revenue_impact": "low",
    "data_impact": "availability"
  },
  "actions": [
    {
      "id": "ACT-001",
      "description": "Increase default connection pool size in Railway config",
      "assignee": "platform-team",
      "status": "completed"
    },
    {
      "id": "ACT-002",
      "description": "Add connection pool exhaustion alert",
      "assignee": "oncall-backend",
      "status": "pending",
      "due": "2024-01-16T17:00:00Z"
    }
  ],
  "communications": [
    {
      "timestamp": "2024-01-15T14:12:00Z",
      "channel": "slack",
      "audience": "internal",
      "message": "🔴 SEV2 Incident declared: API errors affecting workflow executions. Investigating."
    },
    {
      "timestamp": "2024-01-15T14:35:00Z",
      "channel": "slack",
      "audience": "internal",
      "message": "✅ Incident resolved: Connection pool issue mitigated. Monitoring stable."
    }
  ],
  "resolution": {
    "summary": "Increased database connection pool size from 20 to 50 connections",
    "root_cause": "Spike in concurrent workflow executions exceeded connection pool capacity",
    "time_to_detect": "5 minutes",
    "time_to_mitigate": "15 minutes",
    "time_to_resolve": "25 minutes"
  }
}
Key Actions:
  • Declared incident within 3 minutes of alert
  • Coordinated investigation (logs, metrics, recent changes)
  • Applied mitigation (pool size increase) in 15 minutes
  • Communicated status to stakeholders
  • Created follow-up action items

Output Format

Incident Record Schema

interface IncidentRecord {
  type: "incident-record";
  version: "1.0.0";
  generated_by: "incident-commander";
  timestamp: string; // ISO8601
  
  content: {
    incident_id: string; // INC-YYYYMMDD-XXXX
    title: string;
    severity: "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4";
    status: "detected" | "investigating" | "identified" | "mitigating" | "resolved" | "closed";
    commander: string; // Responder ID
    
    timeline: Array<{
      timestamp: string;
      event: string;
      actor: string; // Person or system
      category: "detection" | "investigation" | "action" | "communication" | "resolution";
    }>;
    
    impact: {
      users_affected: number | "all";
      services_affected: string[];
      revenue_impact: "none" | "low" | "medium" | "high" | "critical";
      data_impact: "none" | "integrity" | "availability" | "confidentiality";
    };
    
    actions: Array<{
      id: string; // ACT-XXX
      description: string;
      assignee: string;
      status: "pending" | "in_progress" | "completed" | "blocked";
      due?: string;
    }>;
    
    communications: Array<{
      timestamp: string;
      channel: "slack" | "email" | "status_page" | "phone";
      audience: "internal" | "customers" | "all";
      message: string;
    }>;
    
    decisions: string[]; // ADR format
    
    resolution?: {
      summary: string;
      root_cause: string;
      time_to_detect: string;
      time_to_mitigate: string;
      time_to_resolve: string;
    };
  };
}

Severity Definitions

LevelCriteriaResponse TimeCommanderEscalation
SEV0Complete outage, data loss, security breachImmediateSenior engineer + managementExecutive + all hands
SEV1Major degradation, significant user impact<5 minutesSenior engineerEngineering leadership
SEV2Partial functionality loss, moderate impact<15 minutesOn-call engineerTeam lead (if prolonged)
SEV3Minor issues, workarounds available<1 hourOn-call engineerNone (standard queue)
SEV4Cosmetic issues, no functional impactBest effortAny engineerNone (backlog)

Communication Templates

Internal Update (Slack)

🔴 **SEV{X} Incident: {title}**

**Status**: {investigating|identified|mitigating|resolved}
**Impact**: {impact_description}
**Current actions**: {what_we_are_doing}
**ETA**: {estimated_resolution_time}

Commander: @{commander}
War room: #{channel}

Customer Communication (Status Page)

**Incident: {title}**

We are currently experiencing {issue_description}.

**Impact**: {customer_impact}
**Status**: {current_status}
**Next update**: {timestamp}

We apologize for any inconvenience and will provide updates as we work toward resolution.

Executive Briefing (SEV0-SEV1)

**CRITICAL INCIDENT BRIEF**

**Incident**: {title}
**Severity**: {SEV0|SEV1}
**Impact**: {users_affected} users, {revenue_impact} revenue impact
**Status**: {current_status}

**Actions Taken**:
- {action_1}
- {action_2}

**ETA to Resolution**: {estimate}
**Commander**: {name}

FORGE Gate Compliance

Before invoking this agent, ensure:
  • Incident detected: Automated alert or manual declaration received
  • Severity assessed: Triage Responder has classified severity (SEV0-4)
  • Channel established: War room or incident channel created
  • Responder notified: On-call engineer(s) paged and available
Verification: Factory Orchestrator confirms triage complete before Commander activation
This agent completes successfully when:
  • Incident resolved: Service restored or stable mitigation in place
  • Actions documented: All action items captured and assigned
  • Communications sent: Stakeholders notified of resolution
  • Timeline complete: Full incident chronology documented
  • Handoff initiated: Postmortem Analyst engaged for analysis
  • Decision record logged: Key decisions documented in ADR format
Verification: Gatekeeper validates completeness before closing incident
All critical incident decisions are logged as:
date:2024-01-20T09:23:00Z|context:Primary cluster unresponsive for 8 minutes|decision:Initiate immediate failover to secondary cluster|rationale:Recovery time uncertain, customer impact critical, RTO exceeded|consequences:Faster resolution, possible data lag up to 5 minutes|status:accepted

Integration Points

Control Plane API

Used Endpoints:
  • GET /api/v1/health - Service health status checks
  • GET /api/v1/workflows/{id}/executions - Execution history for affected workflows
  • POST /api/v1/workflows/{id}/pause - Pause workflows during mitigation
  • POST /api/incidents - Create incident records
  • PATCH /api/incidents/:id - Update incident status

n8n Workflow Integration

  • Pause workflows: Stop affected workflows during incidents
  • Alert enrichment: Fetch execution logs for context
  • Notification dispatch: Trigger stakeholder alerts via n8n workflows
  • Timeline updates: Real-time incident status updates

Veritas Prompt Library

Consumes:
  • vrt-incident01: Incident response playbook templates
  • vrt-comms01: Stakeholder communication templates
  • vrt-escalate01: Escalation decision criteria
Produces:
  • Novel incident response tasks in veritas/agent-prompts/incident/
  • Status: draft (requires review)
AgentRelationshipIntegration Point
Triage ResponderUpstreamReceives severity assessment and initial context
Postmortem AnalystDownstreamHands off resolved incident for RCA
Factory OrchestratorPeerEscalates for multi-agent coordination needs
Railway DeployerConsumerMay request rollbacks or infrastructure changes

Workflow Process

1

Incident Declaration

Formally declare and initialize the incident
  • Assign incident ID (INC-YYYYMMDD-XXXX)
  • Create war room/channel
  • Set initial severity
  • Assign commander
2

Investigation Coordination

Coordinate investigation activities
  • Assign investigation tasks
  • Identify relevant logs/metrics
  • Start timeline tracking
  • Document hypotheses
3

Communication Management

Manage stakeholder communications
  • Post internal updates (Slack)
  • Send customer communications (status page)
  • Executive briefings (SEV0-1)
  • Regular status updates
4

Mitigation/Resolution

Coordinate mitigation and resolution
  • Execute mitigation actions
  • Verify resolution
  • Confirm monitoring stable
  • Prepare rollback plan (if needed)
5

Handoff

Complete incident and hand off
  • Finalize timeline
  • Document action items
  • Schedule postmortem
  • Close incident record

Error Handling

Common Issues

Escalation DelayCause: Severity underestimated, incident worseningResolution: Re-assess severity, escalate immediately, update communications
Communication GapCause: Stakeholders not receiving updatesResolution: Establish regular cadence (every 15-30 min), use multiple channels
Action Item LossCause: Actions discussed but not documentedResolution: Real-time documentation, assign owners immediately, track in incident record

Escalation Path

If Commander cannot manage incident effectively:
  1. Escalate to engineering leadership (SEV0-1)
  2. Request additional responders if needed
  3. Hand off command if commander unavailable
  4. Engage Factory Orchestrator for multi-domain issues

Success Metrics

MetricTargetCritical Threshold
Time to Declaration<5 minutes>10 minutes
First Update Latency<10 minutes>20 minutes
Action Item Capture100%<90%
Communication FrequencyEvery 15-30 min>1 hour gaps

Source Files

View Agent Source

Maintained in so1-agents repository under agents/incident/incident-commander.md