Overview
This runbook provides step-by-step guidance for responding to production incidents in the SO1 platform, covering the complete lifecycle from initial detection through postmortem analysis.
Production Critical: This runbook contains emergency procedures. Bookmark this page for rapid access during incidents.
Severity Levels
| Level | Criteria | Response Time | Commander |
|---|
| SEV0 | Complete outage, data loss, security breach | Immediate | Engineering lead + Executive |
| SEV1 | Major feature down, significant user impact | <5 minutes | Senior engineer |
| SEV2 | Degraded performance, moderate impact | <15 minutes | On-call engineer |
| SEV3 | Minor issues, workarounds available | <1 hour | On-call engineer |
| SEV4 | Cosmetic issues, no functional impact | Best effort | Any engineer |
Quick Start: Incident Response
Detect & Triage (0-2 min)
Alert received → Assess severity → Page responders
Declare & Coordinate (2-5 min)
Create war room → Assign commander → Initial communication
Investigate & Mitigate (5-60 min)
Gather context → Identify root cause → Apply mitigation
Resolve & Verify (Variable)
Confirm resolution → Monitor stability → Close incident
Postmortem (24-48 hours)
Analyze root cause → Create action items → Document learnings
Phase 1: Detection & Triage
Automated Alert Reception
When: Alert fires from monitoring system (Datadog, PagerDuty, Sentry)
Procedure:
-
Acknowledge alert within 2 minutes
# PagerDuty: Press "Acknowledge" in app or SMS reply "ack"
# Slack: React with 👀 to alert message
-
Invoke Triage Responder agent
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "triage-responder",
"inputs": {
"alert_id": "pd-alert-123",
"alert_source": "pagerduty",
"alert_data": {
"service": "so1-control-plane-api",
"metric": "error_rate",
"current_value": "12%",
"threshold": "< 1%"
}
}
}'
-
Review triage report (30 seconds)
- Check severity classification (SEV0-4)
- Review impact estimate (users affected, services)
- Note initial hypothesis and investigation steps
-
Decide escalation
- SEV0-1: Immediate escalation → Phase 2
- SEV2-3: Standard escalation → Phase 2
- SEV4: Create ticket, no incident
- Non-incident: Close alert, monitor
Verification:
- ✅ Severity assigned with confidence >70%
- ✅ Impact scope estimated
- ✅ Initial hypothesis documented
Troubleshooting:
| Issue | Cause | Resolution |
|---|
| Alert storm (10+ alerts) | Cascading failure | Correlate alerts, treat as single incident |
| Low confidence (<70%) | Unclear symptoms | Escalate with “investigation required” status |
| Health endpoints down | Service unavailable | Use last known state, assume SEV1 minimum |
Phase 2: Incident Declaration
Create War Room & Assign Commander
When: SEV0-3 incident confirmed by Triage Responder
Procedure:
-
Create incident channel (1 minute)
# Slack command
/incident create SEV2 "API error rate spike"
# Or manual:
# Create channel: #incident-YYYYMMDD-XXXX
# Set topic: "SEV2: API error rate spike | Commander: @oncall-backend"
-
Invoke Incident Commander agent
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "incident-commander",
"inputs": {
"incident_id": "INC-20240115-0042",
"severity": "SEV2",
"triage_report": {
"symptoms": ["High error rate", "Connection pool exhaustion"],
"impact": {"users_affected": 150, "services": ["control-plane-api"]},
"hypothesis": ["Connection pool undersized"]
},
"commander": "oncall-backend",
"channel": "#incident-20240115-0042"
}
}'
-
Send initial communication (2 minutes)
Internal (Slack):
🔴 **SEV2 Incident Declared: API Error Rate Spike**
**Status**: Investigating
**Impact**: 150 users, workflow executions failing
**Commander**: @oncall-backend
**War Room**: #incident-20240115-0042
Investigating database connection pool exhaustion. Updates every 15 min.
Customer (Status Page) - Only if SEV0-1:
**Investigating - API Performance Issues**
We are currently experiencing elevated error rates affecting workflow executions.
Our team is actively investigating. Next update in 15 minutes.
-
Page additional responders (SEV0-1 only)
# Escalate to leadership
/pd escalate to engineering-lead
# All-hands notification (SEV0 only)
@channel 🚨 SEV0 INCIDENT: Complete outage. Join #incident-YYYYMMDD-XXXX
Verification:
- ✅ War room created and active
- ✅ Commander assigned and present
- ✅ Initial communication sent (<5 min from detection)
- ✅ Incident Commander agent running
Troubleshooting:
| Issue | Cause | Resolution |
|---|
| Commander unavailable | PTO, no response | Assign backup from rotation |
| Unclear ownership | Multi-team incident | Assign most affected service owner |
| Status page down | Meta-incident | Use Twitter/email for customer comms |
Phase 3: Investigation & Mitigation
Root Cause Investigation
Procedure:
-
Gather context (5 minutes)
Check recent deployments:
# Via Control Plane API
curl https://api.so1.io/v1/deployments/recent?hours=24 \
-H "Authorization: Bearer $SO1_API_KEY"
# Via Railway
railway status --service so1-control-plane-api
Review metrics:
# Datadog dashboard
open https://app.datadoghq.com/dashboard/so1-production
# Key metrics:
# - Error rate trend
# - Response latency (p50, p95, p99)
# - Database connection pool utilization
# - CPU/Memory usage
Check logs:
# Via Datadog
# Search: service:control-plane-api status:error
# Time range: Last 30 minutes
# Look for:
# - Error patterns
# - Stack traces
# - Correlation with deployments
-
Identify root cause (10-30 minutes)
Common patterns:
| Symptom | Likely Cause | Investigation |
|---|
| 5xx spike + pool at 100% | Connection exhaustion | Check pool config vs load |
| Latency spike | Slow queries, N+1 pattern | Review slow query logs |
| Gradual degradation | Memory leak | Check memory trends over hours |
| Sudden outage | Deployment issue | Compare current vs previous version |
| Regional issues | Infrastructure problem | Check Railway/AWS status |
-
Document hypothesis
# Post in incident channel:
**Hypothesis**: Database connection pool exhaustion
**Confidence**: 85%
**Evidence**:
- Connection pool at 100% utilization
- Errors: "Connection timeout after 5000ms"
- Traffic 2x normal due to bulk execution feature
- Deployed 4 hours ago
**Investigation steps**:
1. Check connection pool configuration ✅
2. Review bulk execution concurrency ✅
3. Assess if pool can be increased 🔄
Apply Mitigation
Procedure:
-
Choose mitigation strategy
| Strategy | When to Use | Risk | Speed |
|---|
| Configuration change | Wrong setting identified | Low | Fast (minutes) |
| Rollback | Recent deployment caused issue | Medium | Fast (5-10 min) |
| Scale up | Resource exhaustion | Low | Medium (10-15 min) |
| Feature flag disable | Specific feature causing issue | Low | Fast (seconds) |
| Restart | Unknown transient issue | Medium | Fast (1-5 min) |
| Failover | Primary system failed | High | Medium (10-20 min) |
-
Execute mitigation
Example: Increase connection pool:
# Via Railway
railway variables set DATABASE_POOL_SIZE=50 --service control-plane-api
railway up --service control-plane-api
# Wait for deployment (2-3 minutes)
railway logs --service control-plane-api --follow
Example: Rollback deployment:
# Via Railway Deployer agent
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "railway-deployer",
"inputs": {
"action": "rollback",
"service": "control-plane-api",
"target_version": "previous"
}
}'
Example: Disable feature flag:
# Via feature flag service
curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{"enabled": false}'
-
Monitor mitigation impact (5-15 minutes)
# Watch error rate
# Datadog: Monitor error rate graph
# Target: Return to <1% within 5 minutes
# Check user impact
# Datadog: Track successful workflow executions
# Target: Success rate >95%
# Verify service health
curl https://api.so1.io/v1/health
# Expected: {"status": "healthy", "checks": {"database": "ok", ...}}
Verification:
- ✅ Error rate returned to baseline (<1%)
- ✅ Response latency normalized (<500ms p99)
- ✅ No new alerts firing
- ✅ Service health checks passing
- ✅ User-facing functionality restored
Troubleshooting:
| Issue | Cause | Resolution |
|---|
| Mitigation ineffective | Wrong root cause | Try alternative hypothesis |
| Partial improvement | Multiple issues | Address remaining factors |
| New symptoms appear | Side effect of mitigation | Rollback mitigation, reassess |
Phase 4: Resolution & Verification
Confirm Stable Resolution
Procedure:
-
Monitor for stability (15-30 minutes)
- Watch key metrics for regression
- Verify no new related alerts
- Check user reports (support channels, social media)
- Confirm database connections stable
-
Update status communications
Internal:
✅ **Incident Resolved: API Error Rate Spike**
**Resolution**: Increased database connection pool from 20 to 50 connections
**Root Cause**: Bulk execution feature exceeded pool capacity
**Monitoring**: Stable for 30 minutes, no recurrence
**Next Steps**:
- Postmortem scheduled for tomorrow 2pm
- Action items: Add pool monitoring, load test bulk features
Thanks to @oncall-backend, @platform-team for rapid response.
Customer (if notified):
**Resolved - API Performance Issues**
This incident has been resolved. All services are operating normally.
We apologize for any inconvenience.
-
Close incident record
# Update incident status
curl -X PATCH https://api.so1.io/v1/incidents/INC-20240115-0042 \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"status": "resolved",
"resolution_summary": "Increased connection pool size",
"resolved_at": "2024-01-15T14:30:00Z"
}'
-
Schedule postmortem
# Create calendar event (24-48 hours after resolution)
# Title: "Postmortem: INC-20240115-0042 - API Error Rate Spike"
# Attendees: Commander, responders, affected team leads
# Duration: 60 minutes
Verification:
- ✅ Metrics stable for 30+ minutes
- ✅ All stakeholders notified of resolution
- ✅ Incident record updated to “resolved”
- ✅ Postmortem scheduled within 48 hours
Phase 5: Postmortem Analysis
Conduct Blameless Postmortem
When: 24-48 hours after incident resolution
Procedure:
-
Invoke Postmortem Analyst agent
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "postmortem-analyst",
"inputs": {
"incident_id": "INC-20240115-0042",
"analysis_method": "5_whys",
"responders": ["oncall-backend", "platform-team"],
"timeline_source": "incident_record"
}
}'
-
Review generated postmortem draft
- Verify timeline accuracy
- Validate root cause analysis
- Review action items for completeness
- Add “what went well” and “where we got lucky”
-
Conduct postmortem meeting (60 minutes)
- Present timeline and root cause
- Discuss contributing factors (no blame)
- Review action items and assign owners
- Capture additional learnings
-
Publish final postmortem
# Location: so1-io/so1-content/postmortems/2024-01-15-api-error-rate.md
# Share internally via Slack, email
# Archive in knowledge base
Verification:
- ✅ Root cause identified with >85% confidence
- ✅ Action items created with owners and due dates
- ✅ Postmortem published within 72 hours
- ✅ Learnings incorporated into runbooks
Common Incident Scenarios
Complete Service Outage (SEV0)
Symptoms: All services returning 5xx, users cannot access platform
Quick Actions:
- Declare SEV0, page all hands
- Check Railway infrastructure status
- Verify database cluster health
- Check DNS resolution
- Review recent deployments (last 4 hours)
- If deployment-related: Rollback immediately
- If infrastructure: Contact Railway support, prepare failover
- Communicate to customers within 5 minutes
Related Agents: Triage Responder → Incident Commander → Railway Deployer
Database Failure (SEV0)
Symptoms: Connection errors, query timeouts, replication lag
Quick Actions:
- Check Railway database metrics (CPU, memory, connections)
- Verify primary/replica status
- Check for long-running queries:
SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes';
- If cluster down: Initiate failover to replica
- If connection exhaustion: Restart connection pooler
- If slow queries: Kill blocking queries
- Monitor replication lag after failover
Critical Decision: Failover within 10 minutes if primary unresponsive
Security Breach (SEV0)
Symptoms: Unauthorized access, data exfiltration, anomalous activity
Quick Actions:
- DO NOT publicly disclose until assessed
- Rotate all API keys and credentials immediately
- Review audit logs for unauthorized access
- Isolate affected services if necessary
- Contact security team and legal
- Preserve evidence (logs, metrics, database dumps)
- Prepare customer communication (coordinate with legal)
Escalation: Immediate executive and legal notification
| Role | Contact Method | Response Time |
|---|
| On-Call Engineer | PagerDuty | <5 minutes |
| Engineering Lead | PagerDuty + Phone | <10 minutes |
| Platform Team | Slack: @platform-team | <15 minutes |
| Railway Support | support@railway.app | <30 minutes |
| Executive (SEV0) | Phone tree | <15 minutes |