Overview
The Incident domain manages the complete incident lifecycle for the SO1 platform, from initial detection and severity assessment through resolution tracking and blameless post-incident analysis. These agents ensure rapid response, clear communication, and continuous learning from operational incidents.Production-Critical: Incident agents operate in high-pressure scenarios with strict SLA requirements. All workflows include escalation paths, fallback strategies, and comprehensive logging for post-incident review.
Agents in This Domain
Incident Commander
Orchestrates incident response workflow and stakeholder communication
Triage Responder
Performs initial severity assessment and routing decisions
Postmortem Analyst
Conducts blameless analysis and generates improvement recommendations
Domain Responsibilities
Incident Detection & Triage
- Severity classification: Assess impact using SEV0-SEV4 scale
- Automated routing: Direct incidents to appropriate teams based on severity
- Context gathering: Collect metrics, logs, and recent changes
- Initial response: Generate action plans and notification templates
Response Orchestration
- Communication management: Coordinate updates to stakeholders
- Timeline tracking: Maintain detailed incident chronology
- Escalation handling: Trigger escalation paths for critical incidents
- Status monitoring: Track resolution progress and SLA compliance
Post-Incident Analysis
- Root cause analysis: Identify contributing factors and system weaknesses
- Timeline reconstruction: Build detailed incident narrative
- Improvement recommendations: Generate actionable remediation items
- Knowledge capture: Document learnings for future prevention
Severity Scale
The SO1 platform uses a 5-tier severity classification:| Level | Impact | Response Time | Escalation |
|---|---|---|---|
| SEV0 | Complete outage, data loss risk | Immediate | Executive + All hands |
| SEV1 | Major feature down, significant user impact | <15 min | Engineering leadership |
| SEV2 | Degraded performance, partial functionality | <1 hour | On-call engineer |
| SEV3 | Minor issues, workarounds available | <4 hours | Standard queue |
| SEV4 | Cosmetic issues, no user impact | Best effort | Backlog |
Incident Workflow
Integration with FORGE
All incident agents operate within the FORGE execution model:FORGE Stage: Detect (D)
FORGE Stage: Detect (D)
Triage Responder activates here:
- Alert data ingestion and normalization
- Severity classification using historical patterns
- Context enrichment (metrics, logs, deployments)
- Initial routing decision
FORGE Stage: Respond (R)
FORGE Stage: Respond (R)
Incident Commander activates here:
- Response workflow orchestration
- Stakeholder notification
- Timeline tracking and updates
- Escalation management
FORGE Stage: Analyze (A)
FORGE Stage: Analyze (A)
Postmortem Analyst activates here:
- Root cause investigation
- Timeline reconstruction
- Contributing factor identification
- Improvement recommendation generation
Common Use Cases
Critical Production Outage (SEV0)
Scenario: Database cluster failure causing complete service unavailability Workflow:- Triage Responder: Classifies as SEV0 based on metrics (100% error rate)
- Incident Commander: Triggers immediate escalation, notifies executives, creates war room
- Incident Commander: Orchestrates failover, coordinates communication, tracks SLA
- Postmortem Analyst: After resolution, identifies root cause (connection pool exhaustion), recommends circuit breaker implementation
Performance Degradation (SEV2)
Scenario: API response times increased by 300%, intermittent timeouts Workflow:- Triage Responder: Classifies as SEV2 (degraded but functional)
- Incident Commander: Notifies on-call engineer, creates incident ticket
- Incident Commander: Tracks mitigation (database index added), communicates status
- Postmortem Analyst: Documents slow query pattern, recommends query optimization review
Minor UI Bug (SEV4)
Scenario: Cosmetic alignment issue on settings page Workflow:- Triage Responder: Classifies as SEV4 (no functional impact)
- Incident Commander: Creates backlog ticket, no immediate escalation
- Postmortem Analyst: Skipped (not critical enough for formal analysis)
Control Plane Integration
Incident agents interact with the SO1 Control Plane API for:Data Retrieval
- GET /api/incidents: Fetch historical incident data for pattern matching
- GET /api/services: Retrieve service topology and ownership information
- GET /api/metrics: Query time-series metrics for severity assessment
Incident Management
- POST /api/incidents: Create new incident records
- PATCH /api/incidents/:id: Update incident status and timeline
- POST /api/incidents/:id/notifications: Trigger stakeholder notifications
Post-Incident
- POST /api/postmortems: Create postmortem documents
- POST /api/action-items: Generate follow-up tasks from postmortem analysis
n8n Workflow Integration
Incident workflows are often automated via n8n:- Alert ingestion: Webhook receives alerts from monitoring systems
- Triage automation: Calls Triage Responder for severity classification
- Notification dispatch: Sends alerts via Slack, PagerDuty, email based on severity
- Timeline tracking: Updates incident status in real-time
- Postmortem scheduling: Automatically schedules postmortem review 24h after resolution
Related Documentation
Monitoring & Observability
Agent Performance Metrics
| Metric | Target | Critical Threshold |
|---|---|---|
| Triage Latency | <30 seconds | >60 seconds |
| Commander Response Time | <5 minutes | >15 minutes |
| Severity Accuracy | >95% | <90% |
| Postmortem Completion | <48 hours | >72 hours |
Escalation Triggers
Agents automatically escalate to Factory Orchestrator when:- Severity classification confidence <70%
- Multiple concurrent SEV0-1 incidents
- Postmortem analysis blocked (missing data)
- SLA violation detected
Source Repository
View Source Code
All Incident domain agents are maintained in the
so1-agents repository under agents/incident/