Incident Response Runbook

Overview

This runbook provides step-by-step guidance for responding to production incidents in the SO1 platform, covering the complete lifecycle from initial detection through postmortem analysis.

Production Critical: This runbook contains emergency procedures. Bookmark this page for rapid access during incidents.

Severity Levels

Level	Criteria	Response Time	Commander
SEV0	Complete outage, data loss, security breach	Immediate	Engineering lead + Executive
SEV1	Major feature down, significant user impact	<5 minutes	Senior engineer
SEV2	Degraded performance, moderate impact	<15 minutes	On-call engineer
SEV3	Minor issues, workarounds available	<1 hour	On-call engineer
SEV4	Cosmetic issues, no functional impact	Best effort	Any engineer

Quick Start: Incident Response

Detect & Triage (0-2 min)

Alert received → Assess severity → Page responders

Declare & Coordinate (2-5 min)

Create war room → Assign commander → Initial communication

Investigate & Mitigate (5-60 min)

Gather context → Identify root cause → Apply mitigation

Resolve & Verify (Variable)

Confirm resolution → Monitor stability → Close incident

Postmortem (24-48 hours)

Analyze root cause → Create action items → Document learnings

Phase 1: Detection & Triage

Automated Alert Reception

When: Alert fires from monitoring system (Datadog, PagerDuty, Sentry) Procedure:

Acknowledge alert within 2 minutes

# PagerDuty: Press "Acknowledge" in app or SMS reply "ack"
# Slack: React with 👀 to alert message

Invoke Triage Responder agent

curl -X POST https://api.so1.io/v1/orchestrate \
  -H "Authorization: Bearer $SO1_API_KEY" \
  -d '{
    "agent": "triage-responder",
    "inputs": {
      "alert_id": "pd-alert-123",
      "alert_source": "pagerduty",
      "alert_data": {
        "service": "so1-control-plane-api",
        "metric": "error_rate",
        "current_value": "12%",
        "threshold": "< 1%"
      }
    }
  }'

Review triage report (30 seconds)
- Check severity classification (SEV0-4)
- Review impact estimate (users affected, services)
- Note initial hypothesis and investigation steps
Decide escalation
- SEV0-1: Immediate escalation → Phase 2
- SEV2-3: Standard escalation → Phase 2
- SEV4: Create ticket, no incident
- Non-incident: Close alert, monitor

Verification:

✅ Severity assigned with confidence >70%
✅ Impact scope estimated
✅ Initial hypothesis documented

Troubleshooting:

Issue	Cause	Resolution
Alert storm (10+ alerts)	Cascading failure	Correlate alerts, treat as single incident
Low confidence (<70%)	Unclear symptoms	Escalate with “investigation required” status
Health endpoints down	Service unavailable	Use last known state, assume SEV1 minimum

Phase 2: Incident Declaration

Create War Room & Assign Commander

When: SEV0-3 incident confirmed by Triage Responder Procedure:

Create incident channel (1 minute)

# Slack command
/incident create SEV2 "API error rate spike"

# Or manual:
# Create channel: #incident-YYYYMMDD-XXXX
# Set topic: "SEV2: API error rate spike | Commander: @oncall-backend"

Invoke Incident Commander agent

curl -X POST https://api.so1.io/v1/orchestrate \
  -H "Authorization: Bearer $SO1_API_KEY" \
  -d '{
    "agent": "incident-commander",
    "inputs": {
      "incident_id": "INC-20240115-0042",
      "severity": "SEV2",
      "triage_report": {
        "symptoms": ["High error rate", "Connection pool exhaustion"],
        "impact": {"users_affected": 150, "services": ["control-plane-api"]},
        "hypothesis": ["Connection pool undersized"]
      },
      "commander": "oncall-backend",
      "channel": "#incident-20240115-0042"
    }
  }'

Send initial communication (2 minutes) Internal (Slack):

🔴 **SEV2 Incident Declared: API Error Rate Spike**

**Status**: Investigating
**Impact**: 150 users, workflow executions failing
**Commander**: @oncall-backend
**War Room**: #incident-20240115-0042

Investigating database connection pool exhaustion. Updates every 15 min.

Customer (Status Page) - Only if SEV0-1:

**Investigating - API Performance Issues**

We are currently experiencing elevated error rates affecting workflow executions. 
Our team is actively investigating. Next update in 15 minutes.

Page additional responders (SEV0-1 only)

# Escalate to leadership
/pd escalate to engineering-lead

# All-hands notification (SEV0 only)
@channel 🚨 SEV0 INCIDENT: Complete outage. Join #incident-YYYYMMDD-XXXX

Verification:

✅ War room created and active
✅ Commander assigned and present
✅ Initial communication sent (<5 min from detection)
✅ Incident Commander agent running

Troubleshooting:

Issue	Cause	Resolution
Commander unavailable	PTO, no response	Assign backup from rotation
Unclear ownership	Multi-team incident	Assign most affected service owner
Status page down	Meta-incident	Use Twitter/email for customer comms

Phase 3: Investigation & Mitigation

Root Cause Investigation

Procedure:

Gather context (5 minutes) Check recent deployments:

# Via Control Plane API
curl https://api.so1.io/v1/deployments/recent?hours=24 \
  -H "Authorization: Bearer $SO1_API_KEY"

# Via Railway
railway status --service so1-control-plane-api

Review metrics:

# Datadog dashboard
open https://app.datadoghq.com/dashboard/so1-production

# Key metrics:
# - Error rate trend
# - Response latency (p50, p95, p99)
# - Database connection pool utilization
# - CPU/Memory usage

Check logs:

# Via Datadog
# Search: service:control-plane-api status:error
# Time range: Last 30 minutes

# Look for:
# - Error patterns
# - Stack traces
# - Correlation with deployments

Identify root cause (10-30 minutes) Common patterns:

Symptom	Likely Cause	Investigation
5xx spike + pool at 100%	Connection exhaustion	Check pool config vs load
Latency spike	Slow queries, N+1 pattern	Review slow query logs
Gradual degradation	Memory leak	Check memory trends over hours
Sudden outage	Deployment issue	Compare current vs previous version
Regional issues	Infrastructure problem	Check Railway/AWS status

Document hypothesis

# Post in incident channel:

**Hypothesis**: Database connection pool exhaustion
**Confidence**: 85%
**Evidence**:
- Connection pool at 100% utilization
- Errors: "Connection timeout after 5000ms"
- Traffic 2x normal due to bulk execution feature
- Deployed 4 hours ago

**Investigation steps**:
1. Check connection pool configuration ✅
2. Review bulk execution concurrency ✅
3. Assess if pool can be increased 🔄

Apply Mitigation

Procedure:

Choose mitigation strategy

Strategy	When to Use	Risk	Speed
Configuration change	Wrong setting identified	Low	Fast (minutes)
Rollback	Recent deployment caused issue	Medium	Fast (5-10 min)
Scale up	Resource exhaustion	Low	Medium (10-15 min)
Feature flag disable	Specific feature causing issue	Low	Fast (seconds)
Restart	Unknown transient issue	Medium	Fast (1-5 min)
Failover	Primary system failed	High	Medium (10-20 min)

Execute mitigation Example: Increase connection pool:

# Via Railway
railway variables set DATABASE_POOL_SIZE=50 --service control-plane-api
railway up --service control-plane-api

# Wait for deployment (2-3 minutes)
railway logs --service control-plane-api --follow

Example: Rollback deployment:

# Via Railway Deployer agent
curl -X POST https://api.so1.io/v1/orchestrate \
  -H "Authorization: Bearer $SO1_API_KEY" \
  -d '{
    "agent": "railway-deployer",
    "inputs": {
      "action": "rollback",
      "service": "control-plane-api",
      "target_version": "previous"
    }
  }'

Example: Disable feature flag:

# Via feature flag service
curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
  -H "Authorization: Bearer $SO1_API_KEY" \
  -d '{"enabled": false}'

Monitor mitigation impact (5-15 minutes)

# Watch error rate
# Datadog: Monitor error rate graph
# Target: Return to &lt;1% within 5 minutes

# Check user impact
# Datadog: Track successful workflow executions
# Target: Success rate >95%

# Verify service health
curl https://api.so1.io/v1/health
# Expected: {"status": "healthy", "checks": {"database": "ok", ...}}

Verification:

✅ Error rate returned to baseline (<1%)
✅ Response latency normalized (<500ms p99)
✅ No new alerts firing
✅ Service health checks passing
✅ User-facing functionality restored

Troubleshooting:

Issue	Cause	Resolution
Mitigation ineffective	Wrong root cause	Try alternative hypothesis
Partial improvement	Multiple issues	Address remaining factors
New symptoms appear	Side effect of mitigation	Rollback mitigation, reassess

Phase 4: Resolution & Verification

Confirm Stable Resolution

Procedure:

Monitor for stability (15-30 minutes)
- Watch key metrics for regression
- Verify no new related alerts
- Check user reports (support channels, social media)
- Confirm database connections stable

Update status communications Internal:

✅ **Incident Resolved: API Error Rate Spike**

**Resolution**: Increased database connection pool from 20 to 50 connections
**Root Cause**: Bulk execution feature exceeded pool capacity
**Monitoring**: Stable for 30 minutes, no recurrence

**Next Steps**:
- Postmortem scheduled for tomorrow 2pm
- Action items: Add pool monitoring, load test bulk features

Thanks to @oncall-backend, @platform-team for rapid response.

Customer (if notified):

**Resolved - API Performance Issues**

This incident has been resolved. All services are operating normally. 
We apologize for any inconvenience.

Close incident record

# Update incident status
curl -X PATCH https://api.so1.io/v1/incidents/INC-20240115-0042 \
  -H "Authorization: Bearer $SO1_API_KEY" \
  -d '{
    "status": "resolved",
    "resolution_summary": "Increased connection pool size",
    "resolved_at": "2024-01-15T14:30:00Z"
  }'

Schedule postmortem

# Create calendar event (24-48 hours after resolution)
# Title: "Postmortem: INC-20240115-0042 - API Error Rate Spike"
# Attendees: Commander, responders, affected team leads
# Duration: 60 minutes

Verification:

✅ Metrics stable for 30+ minutes
✅ All stakeholders notified of resolution
✅ Incident record updated to “resolved”
✅ Postmortem scheduled within 48 hours

Phase 5: Postmortem Analysis

Conduct Blameless Postmortem

When: 24-48 hours after incident resolution Procedure:

Invoke Postmortem Analyst agent

curl -X POST https://api.so1.io/v1/orchestrate \
  -H "Authorization: Bearer $SO1_API_KEY" \
  -d '{
    "agent": "postmortem-analyst",
    "inputs": {
      "incident_id": "INC-20240115-0042",
      "analysis_method": "5_whys",
      "responders": ["oncall-backend", "platform-team"],
      "timeline_source": "incident_record"
    }
  }'

Review generated postmortem draft
- Verify timeline accuracy
- Validate root cause analysis
- Review action items for completeness
- Add “what went well” and “where we got lucky”
Conduct postmortem meeting (60 minutes)
- Present timeline and root cause
- Discuss contributing factors (no blame)
- Review action items and assign owners
- Capture additional learnings

Publish final postmortem

# Location: so1-io/so1-content/postmortems/2024-01-15-api-error-rate.md
# Share internally via Slack, email
# Archive in knowledge base

Verification:

✅ Root cause identified with >85% confidence
✅ Action items created with owners and due dates
✅ Postmortem published within 72 hours
✅ Learnings incorporated into runbooks

Common Incident Scenarios

Complete Service Outage (SEV0)

Symptoms: All services returning 5xx, users cannot access platform Quick Actions:

Declare SEV0, page all hands
Check Railway infrastructure status
Verify database cluster health
Check DNS resolution
Review recent deployments (last 4 hours)
If deployment-related: Rollback immediately
If infrastructure: Contact Railway support, prepare failover
Communicate to customers within 5 minutes

Related Agents: Triage Responder → Incident Commander → Railway Deployer

Database Failure (SEV0)

Symptoms: Connection errors, query timeouts, replication lag Quick Actions:

Check Railway database metrics (CPU, memory, connections)
Verify primary/replica status
Check for long-running queries: SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes';
If cluster down: Initiate failover to replica
If connection exhaustion: Restart connection pooler
If slow queries: Kill blocking queries
Monitor replication lag after failover

Critical Decision: Failover within 10 minutes if primary unresponsive

Security Breach (SEV0)

Symptoms: Unauthorized access, data exfiltration, anomalous activity Quick Actions:

DO NOT publicly disclose until assessed
Rotate all API keys and credentials immediately
Review audit logs for unauthorized access
Isolate affected services if necessary
Contact security team and legal
Preserve evidence (logs, metrics, database dumps)
Prepare customer communication (coordinate with legal)

Escalation: Immediate executive and legal notification

Deployment Runbook - Deployment and rollback procedures
Monitoring Runbook - Alert interpretation and health checks
Incident Domain Guide - Detailed agent usage for incident response

Triage Responder - Initial severity assessment
Incident Commander - Response orchestration
Postmortem Analyst - Root cause analysis

Emergency Contacts

Role	Contact Method	Response Time
On-Call Engineer	PagerDuty	<5 minutes
Engineering Lead	PagerDuty + Phone	<10 minutes
Platform Team	Slack: @platform-team	<15 minutes
Railway Support	support@railway.app	<30 minutes
Executive (SEV0)	Phone tree	<15 minutes

Runbooks

Domain Runbooks

Overview

Severity Levels

Quick Start: Incident Response

Phase 1: Detection & Triage

Automated Alert Reception

Phase 2: Incident Declaration

Create War Room & Assign Commander

Phase 3: Investigation & Mitigation

Root Cause Investigation

Apply Mitigation

Phase 4: Resolution & Verification

Confirm Stable Resolution

Phase 5: Postmortem Analysis

Conduct Blameless Postmortem

Common Incident Scenarios

Complete Service Outage (SEV0)

Database Failure (SEV0)

Security Breach (SEV0)

Emergency Contacts

Runbooks

Domain Runbooks

​Overview

​Severity Levels

​Quick Start: Incident Response

​Phase 1: Detection & Triage

​Automated Alert Reception

​Phase 2: Incident Declaration

​Create War Room & Assign Commander

​Phase 3: Investigation & Mitigation

​Root Cause Investigation

​Apply Mitigation

​Phase 4: Resolution & Verification

​Confirm Stable Resolution

​Phase 5: Postmortem Analysis

​Conduct Blameless Postmortem

​Common Incident Scenarios

​Complete Service Outage (SEV0)

​Database Failure (SEV0)

​Security Breach (SEV0)

​Related Runbooks

​Related Agents

​Emergency Contacts

Overview

Severity Levels

Quick Start: Incident Response

Phase 1: Detection & Triage

Automated Alert Reception

Phase 2: Incident Declaration

Create War Room & Assign Commander

Phase 3: Investigation & Mitigation

Root Cause Investigation

Apply Mitigation

Phase 4: Resolution & Verification

Confirm Stable Resolution

Phase 5: Postmortem Analysis

Conduct Blameless Postmortem

Common Incident Scenarios

Complete Service Outage (SEV0)

Database Failure (SEV0)

Security Breach (SEV0)

Related Runbooks

Related Agents

Emergency Contacts