Skip to main content

Overview

The Incident domain manages the complete incident lifecycle for the SO1 platform, from initial detection and severity assessment through resolution tracking and blameless post-incident analysis. These agents ensure rapid response, clear communication, and continuous learning from operational incidents.
Production-Critical: Incident agents operate in high-pressure scenarios with strict SLA requirements. All workflows include escalation paths, fallback strategies, and comprehensive logging for post-incident review.

Agents in This Domain

Incident Commander

Orchestrates incident response workflow and stakeholder communication

Triage Responder

Performs initial severity assessment and routing decisions

Postmortem Analyst

Conducts blameless analysis and generates improvement recommendations

Domain Responsibilities

Incident Detection & Triage

  • Severity classification: Assess impact using SEV0-SEV4 scale
  • Automated routing: Direct incidents to appropriate teams based on severity
  • Context gathering: Collect metrics, logs, and recent changes
  • Initial response: Generate action plans and notification templates

Response Orchestration

  • Communication management: Coordinate updates to stakeholders
  • Timeline tracking: Maintain detailed incident chronology
  • Escalation handling: Trigger escalation paths for critical incidents
  • Status monitoring: Track resolution progress and SLA compliance

Post-Incident Analysis

  • Root cause analysis: Identify contributing factors and system weaknesses
  • Timeline reconstruction: Build detailed incident narrative
  • Improvement recommendations: Generate actionable remediation items
  • Knowledge capture: Document learnings for future prevention

Severity Scale

The SO1 platform uses a 5-tier severity classification:
LevelImpactResponse TimeEscalation
SEV0Complete outage, data loss riskImmediateExecutive + All hands
SEV1Major feature down, significant user impact<15 minEngineering leadership
SEV2Degraded performance, partial functionality<1 hourOn-call engineer
SEV3Minor issues, workarounds available<4 hoursStandard queue
SEV4Cosmetic issues, no user impactBest effortBacklog

Incident Workflow

Integration with FORGE

All incident agents operate within the FORGE execution model:
Triage Responder activates here:
  • Alert data ingestion and normalization
  • Severity classification using historical patterns
  • Context enrichment (metrics, logs, deployments)
  • Initial routing decision
Incident Commander activates here:
  • Response workflow orchestration
  • Stakeholder notification
  • Timeline tracking and updates
  • Escalation management
Postmortem Analyst activates here:
  • Root cause investigation
  • Timeline reconstruction
  • Contributing factor identification
  • Improvement recommendation generation

Common Use Cases

Critical Production Outage (SEV0)

Scenario: Database cluster failure causing complete service unavailability Workflow:
  1. Triage Responder: Classifies as SEV0 based on metrics (100% error rate)
  2. Incident Commander: Triggers immediate escalation, notifies executives, creates war room
  3. Incident Commander: Orchestrates failover, coordinates communication, tracks SLA
  4. Postmortem Analyst: After resolution, identifies root cause (connection pool exhaustion), recommends circuit breaker implementation

Performance Degradation (SEV2)

Scenario: API response times increased by 300%, intermittent timeouts Workflow:
  1. Triage Responder: Classifies as SEV2 (degraded but functional)
  2. Incident Commander: Notifies on-call engineer, creates incident ticket
  3. Incident Commander: Tracks mitigation (database index added), communicates status
  4. Postmortem Analyst: Documents slow query pattern, recommends query optimization review

Minor UI Bug (SEV4)

Scenario: Cosmetic alignment issue on settings page Workflow:
  1. Triage Responder: Classifies as SEV4 (no functional impact)
  2. Incident Commander: Creates backlog ticket, no immediate escalation
  3. Postmortem Analyst: Skipped (not critical enough for formal analysis)

Control Plane Integration

Incident agents interact with the SO1 Control Plane API for:

Data Retrieval

  • GET /api/incidents: Fetch historical incident data for pattern matching
  • GET /api/services: Retrieve service topology and ownership information
  • GET /api/metrics: Query time-series metrics for severity assessment

Incident Management

  • POST /api/incidents: Create new incident records
  • PATCH /api/incidents/:id: Update incident status and timeline
  • POST /api/incidents/:id/notifications: Trigger stakeholder notifications

Post-Incident

  • POST /api/postmortems: Create postmortem documents
  • POST /api/action-items: Generate follow-up tasks from postmortem analysis

n8n Workflow Integration

Incident workflows are often automated via n8n:
  • Alert ingestion: Webhook receives alerts from monitoring systems
  • Triage automation: Calls Triage Responder for severity classification
  • Notification dispatch: Sends alerts via Slack, PagerDuty, email based on severity
  • Timeline tracking: Updates incident status in real-time
  • Postmortem scheduling: Automatically schedules postmortem review 24h after resolution

Monitoring & Observability

Agent Performance Metrics

MetricTargetCritical Threshold
Triage Latency<30 seconds>60 seconds
Commander Response Time<5 minutes>15 minutes
Severity Accuracy>95%<90%
Postmortem Completion<48 hours>72 hours

Escalation Triggers

Agents automatically escalate to Factory Orchestrator when:
  • Severity classification confidence <70%
  • Multiple concurrent SEV0-1 incidents
  • Postmortem analysis blocked (missing data)
  • SLA violation detected

Source Repository

View Source Code

All Incident domain agents are maintained in the so1-agents repository under agents/incident/