Incident Domain - SO1 Field Manual

Overview

The Incident domain manages the complete incident lifecycle for the SO1 platform, from initial detection and severity assessment through resolution tracking and blameless post-incident analysis. These agents ensure rapid response, clear communication, and continuous learning from operational incidents.

Production-Critical: Incident agents operate in high-pressure scenarios with strict SLA requirements. All workflows include escalation paths, fallback strategies, and comprehensive logging for post-incident review.

Agents in This Domain

Incident Commander

Orchestrates incident response workflow and stakeholder communication

Triage Responder

Performs initial severity assessment and routing decisions

Postmortem Analyst

Conducts blameless analysis and generates improvement recommendations

Domain Responsibilities

Incident Detection & Triage

Severity classification: Assess impact using SEV0-SEV4 scale
Automated routing: Direct incidents to appropriate teams based on severity
Context gathering: Collect metrics, logs, and recent changes
Initial response: Generate action plans and notification templates

Response Orchestration

Communication management: Coordinate updates to stakeholders
Timeline tracking: Maintain detailed incident chronology
Escalation handling: Trigger escalation paths for critical incidents
Status monitoring: Track resolution progress and SLA compliance

Post-Incident Analysis

Root cause analysis: Identify contributing factors and system weaknesses
Timeline reconstruction: Build detailed incident narrative
Improvement recommendations: Generate actionable remediation items
Knowledge capture: Document learnings for future prevention

Severity Scale

The SO1 platform uses a 5-tier severity classification:

Level	Impact	Response Time	Escalation
SEV0	Complete outage, data loss risk	Immediate	Executive + All hands
SEV1	Major feature down, significant user impact	<15 min	Engineering leadership
SEV2	Degraded performance, partial functionality	<1 hour	On-call engineer
SEV3	Minor issues, workarounds available	<4 hours	Standard queue
SEV4	Cosmetic issues, no user impact	Best effort	Backlog

Incident Workflow

Integration with FORGE

All incident agents operate within the FORGE execution model:

FORGE Stage: Detect (D)

Triage Responder activates here:

Alert data ingestion and normalization
Severity classification using historical patterns
Context enrichment (metrics, logs, deployments)
Initial routing decision

FORGE Stage: Respond (R)

Incident Commander activates here:

Response workflow orchestration
Stakeholder notification
Timeline tracking and updates
Escalation management

FORGE Stage: Analyze (A)

Postmortem Analyst activates here:

Root cause investigation
Timeline reconstruction
Contributing factor identification
Improvement recommendation generation

Common Use Cases

Critical Production Outage (SEV0)

Scenario: Database cluster failure causing complete service unavailability Workflow:

Triage Responder: Classifies as SEV0 based on metrics (100% error rate)
Incident Commander: Triggers immediate escalation, notifies executives, creates war room
Incident Commander: Orchestrates failover, coordinates communication, tracks SLA
Postmortem Analyst: After resolution, identifies root cause (connection pool exhaustion), recommends circuit breaker implementation

Performance Degradation (SEV2)

Scenario: API response times increased by 300%, intermittent timeouts Workflow:

Triage Responder: Classifies as SEV2 (degraded but functional)
Incident Commander: Notifies on-call engineer, creates incident ticket
Incident Commander: Tracks mitigation (database index added), communicates status
Postmortem Analyst: Documents slow query pattern, recommends query optimization review

Minor UI Bug (SEV4)

Scenario: Cosmetic alignment issue on settings page Workflow:

Triage Responder: Classifies as SEV4 (no functional impact)
Incident Commander: Creates backlog ticket, no immediate escalation
Postmortem Analyst: Skipped (not critical enough for formal analysis)

Control Plane Integration

Incident agents interact with the SO1 Control Plane API for:

Data Retrieval

GET /api/incidents: Fetch historical incident data for pattern matching
GET /api/services: Retrieve service topology and ownership information
GET /api/metrics: Query time-series metrics for severity assessment

Incident Management

POST /api/incidents: Create new incident records
PATCH /api/incidents/:id: Update incident status and timeline
POST /api/incidents/:id/notifications: Trigger stakeholder notifications

Post-Incident

POST /api/postmortems: Create postmortem documents
POST /api/action-items: Generate follow-up tasks from postmortem analysis

n8n Workflow Integration

Incident workflows are often automated via n8n:

Alert ingestion: Webhook receives alerts from monitoring systems
Triage automation: Calls Triage Responder for severity classification
Notification dispatch: Sends alerts via Slack, PagerDuty, email based on severity
Timeline tracking: Updates incident status in real-time
Postmortem scheduling: Automatically schedules postmortem review 24h after resolution

Monitoring & Observability

Agent Performance Metrics

Metric	Target	Critical Threshold
Triage Latency	<30 seconds	>60 seconds
Commander Response Time	<5 minutes	>15 minutes
Severity Accuracy	>95%	<90%
Postmortem Completion	<48 hours	>72 hours

Escalation Triggers

Agents automatically escalate to Factory Orchestrator when:

Severity classification confidence <70%
Multiple concurrent SEV0-1 incidents
Postmortem analysis blocked (missing data)
SLA violation detected

Source Repository

View Source Code

All Incident domain agents are maintained in the so1-agents repository under agents/incident/

​Overview

​Agents in This Domain

Incident Commander

Triage Responder

Postmortem Analyst

​Domain Responsibilities

​Incident Detection & Triage

​Response Orchestration

​Post-Incident Analysis

​Severity Scale

​Incident Workflow

​Integration with FORGE

​Common Use Cases

​Critical Production Outage (SEV0)

​Performance Degradation (SEV2)

​Minor UI Bug (SEV4)

​Control Plane Integration

​Data Retrieval

​Incident Management

​Post-Incident

​n8n Workflow Integration

​Related Documentation

​Monitoring & Observability

​Agent Performance Metrics

​Escalation Triggers

​Source Repository

View Source Code

Overview

Agents in This Domain

Domain Responsibilities

Incident Detection & Triage

Response Orchestration

Post-Incident Analysis

Severity Scale

Incident Workflow

Integration with FORGE

Common Use Cases

Critical Production Outage (SEV0)

Performance Degradation (SEV2)

Minor UI Bug (SEV4)

Control Plane Integration

Data Retrieval

Incident Management

Post-Incident

n8n Workflow Integration

Related Documentation

Monitoring & Observability

Agent Performance Metrics

Escalation Triggers

Source Repository