Skip to main content

Overview

The Incident domain manages the complete incident lifecycle for the SO1 platform, from initial detection and severity assessment through resolution tracking and blameless post-incident analysis. These agents ensure rapid response, clear communication, and continuous learning from operational incidents.
Production-Critical: Incident agents operate in high-pressure scenarios with strict SLA requirements. All workflows include escalation paths, fallback strategies, and comprehensive logging for post-incident review.

Agents in This Domain

Domain Responsibilities

Incident Detection & Triage

  • Severity classification: Assess impact using SEV0-SEV4 scale
  • Automated routing: Direct incidents to appropriate teams based on severity
  • Context gathering: Collect metrics, logs, and recent changes
  • Initial response: Generate action plans and notification templates

Response Orchestration

  • Communication management: Coordinate updates to stakeholders
  • Timeline tracking: Maintain detailed incident chronology
  • Escalation handling: Trigger escalation paths for critical incidents
  • Status monitoring: Track resolution progress and SLA compliance

Post-Incident Analysis

  • Root cause analysis: Identify contributing factors and system weaknesses
  • Timeline reconstruction: Build detailed incident narrative
  • Improvement recommendations: Generate actionable remediation items
  • Knowledge capture: Document learnings for future prevention

Severity Scale

The SO1 platform uses a 5-tier severity classification:
LevelImpactResponse TimeEscalation
SEV0Complete outage, data loss riskImmediateExecutive + All hands
SEV1Major feature down, significant user impact<15 minEngineering leadership
SEV2Degraded performance, partial functionality<1 hourOn-call engineer
SEV3Minor issues, workarounds available<4 hoursStandard queue
SEV4Cosmetic issues, no user impactBest effortBacklog

Incident Workflow

Integration with FORGE

All incident agents operate within the FORGE execution model:
Triage Responder activates here:
  • Alert data ingestion and normalization
  • Severity classification using historical patterns
  • Context enrichment (metrics, logs, deployments)
  • Initial routing decision
Incident Commander activates here:
  • Response workflow orchestration
  • Stakeholder notification
  • Timeline tracking and updates
  • Escalation management
Postmortem Analyst activates here:
  • Root cause investigation
  • Timeline reconstruction
  • Contributing factor identification
  • Improvement recommendation generation

Common Use Cases

Critical Production Outage (SEV0)

Scenario: Database cluster failure causing complete service unavailability Workflow:
  1. Triage Responder: Classifies as SEV0 based on metrics (100% error rate)
  2. Incident Commander: Triggers immediate escalation, notifies executives, creates war room
  3. Incident Commander: Orchestrates failover, coordinates communication, tracks SLA
  4. Postmortem Analyst: After resolution, identifies root cause (connection pool exhaustion), recommends circuit breaker implementation

Performance Degradation (SEV2)

Scenario: API response times increased by 300%, intermittent timeouts Workflow:
  1. Triage Responder: Classifies as SEV2 (degraded but functional)
  2. Incident Commander: Notifies on-call engineer, creates incident ticket
  3. Incident Commander: Tracks mitigation (database index added), communicates status
  4. Postmortem Analyst: Documents slow query pattern, recommends query optimization review

Minor UI Bug (SEV4)

Scenario: Cosmetic alignment issue on settings page Workflow:
  1. Triage Responder: Classifies as SEV4 (no functional impact)
  2. Incident Commander: Creates backlog ticket, no immediate escalation
  3. Postmortem Analyst: Skipped (not critical enough for formal analysis)

Control Plane Integration

Incident agents interact with the SO1 Control Plane API for:

Data Retrieval

  • GET /api/incidents: Fetch historical incident data for pattern matching
  • GET /api/services: Retrieve service topology and ownership information
  • GET /api/metrics: Query time-series metrics for severity assessment

Incident Management

  • POST /api/incidents: Create new incident records
  • PATCH /api/incidents/:id: Update incident status and timeline
  • POST /api/incidents/:id/notifications: Trigger stakeholder notifications

Post-Incident

  • POST /api/postmortems: Create postmortem documents
  • POST /api/action-items: Generate follow-up tasks from postmortem analysis

n8n Workflow Integration

Incident workflows are often automated via n8n:
  • Alert ingestion: Webhook receives alerts from monitoring systems
  • Triage automation: Calls Triage Responder for severity classification
  • Notification dispatch: Sends alerts via Slack, PagerDuty, email based on severity
  • Timeline tracking: Updates incident status in real-time
  • Postmortem scheduling: Automatically schedules postmortem review 24h after resolution

Monitoring & Observability

Agent Performance Metrics

MetricTargetCritical Threshold
Triage Latency<30 seconds>60 seconds
Commander Response Time<5 minutes>15 minutes
Severity Accuracy>95%<90%
Postmortem Completion<48 hours>72 hours

Escalation Triggers

Agents automatically escalate to Factory Orchestrator when:
  • Severity classification confidence <70%
  • Multiple concurrent SEV0-1 incidents
  • Postmortem analysis blocked (missing data)
  • SLA violation detected

Source Repository

View Source Code

All Incident domain agents are maintained in the so1-agents repository under agents/incident/