Skip to main content

Overview

This runbook provides step-by-step guidance for responding to production incidents in the SO1 platform, covering the complete lifecycle from initial detection through postmortem analysis.
Production Critical: This runbook contains emergency procedures. Bookmark this page for rapid access during incidents.

Severity Levels

LevelCriteriaResponse TimeCommander
SEV0Complete outage, data loss, security breachImmediateEngineering lead + Executive
SEV1Major feature down, significant user impact<5 minutesSenior engineer
SEV2Degraded performance, moderate impact<15 minutesOn-call engineer
SEV3Minor issues, workarounds available<1 hourOn-call engineer
SEV4Cosmetic issues, no functional impactBest effortAny engineer

Quick Start: Incident Response

1

Detect & Triage (0-2 min)

Alert received → Assess severity → Page responders
2

Declare & Coordinate (2-5 min)

Create war room → Assign commander → Initial communication
3

Investigate & Mitigate (5-60 min)

Gather context → Identify root cause → Apply mitigation
4

Resolve & Verify (Variable)

Confirm resolution → Monitor stability → Close incident
5

Postmortem (24-48 hours)

Analyze root cause → Create action items → Document learnings

Phase 1: Detection & Triage

Automated Alert Reception

When: Alert fires from monitoring system (Datadog, PagerDuty, Sentry) Procedure:
  1. Acknowledge alert within 2 minutes
    # PagerDuty: Press "Acknowledge" in app or SMS reply "ack"
    # Slack: React with 👀 to alert message
    
  2. Invoke Triage Responder agent
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "triage-responder",
        "inputs": {
          "alert_id": "pd-alert-123",
          "alert_source": "pagerduty",
          "alert_data": {
            "service": "so1-control-plane-api",
            "metric": "error_rate",
            "current_value": "12%",
            "threshold": "< 1%"
          }
        }
      }'
    
  3. Review triage report (30 seconds)
    • Check severity classification (SEV0-4)
    • Review impact estimate (users affected, services)
    • Note initial hypothesis and investigation steps
  4. Decide escalation
    • SEV0-1: Immediate escalation → Phase 2
    • SEV2-3: Standard escalation → Phase 2
    • SEV4: Create ticket, no incident
    • Non-incident: Close alert, monitor
Verification:
  • ✅ Severity assigned with confidence >70%
  • ✅ Impact scope estimated
  • ✅ Initial hypothesis documented
Troubleshooting:
IssueCauseResolution
Alert storm (10+ alerts)Cascading failureCorrelate alerts, treat as single incident
Low confidence (<70%)Unclear symptomsEscalate with “investigation required” status
Health endpoints downService unavailableUse last known state, assume SEV1 minimum

Phase 2: Incident Declaration

Create War Room & Assign Commander

When: SEV0-3 incident confirmed by Triage Responder Procedure:
  1. Create incident channel (1 minute)
    # Slack command
    /incident create SEV2 "API error rate spike"
    
    # Or manual:
    # Create channel: #incident-YYYYMMDD-XXXX
    # Set topic: "SEV2: API error rate spike | Commander: @oncall-backend"
    
  2. Invoke Incident Commander agent
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "incident-commander",
        "inputs": {
          "incident_id": "INC-20240115-0042",
          "severity": "SEV2",
          "triage_report": {
            "symptoms": ["High error rate", "Connection pool exhaustion"],
            "impact": {"users_affected": 150, "services": ["control-plane-api"]},
            "hypothesis": ["Connection pool undersized"]
          },
          "commander": "oncall-backend",
          "channel": "#incident-20240115-0042"
        }
      }'
    
  3. Send initial communication (2 minutes) Internal (Slack):
    🔴 **SEV2 Incident Declared: API Error Rate Spike**
    
    **Status**: Investigating
    **Impact**: 150 users, workflow executions failing
    **Commander**: @oncall-backend
    **War Room**: #incident-20240115-0042
    
    Investigating database connection pool exhaustion. Updates every 15 min.
    
    Customer (Status Page) - Only if SEV0-1:
    **Investigating - API Performance Issues**
    
    We are currently experiencing elevated error rates affecting workflow executions. 
    Our team is actively investigating. Next update in 15 minutes.
    
  4. Page additional responders (SEV0-1 only)
    # Escalate to leadership
    /pd escalate to engineering-lead
    
    # All-hands notification (SEV0 only)
    @channel 🚨 SEV0 INCIDENT: Complete outage. Join #incident-YYYYMMDD-XXXX
    
Verification:
  • ✅ War room created and active
  • ✅ Commander assigned and present
  • ✅ Initial communication sent (<5 min from detection)
  • ✅ Incident Commander agent running
Troubleshooting:
IssueCauseResolution
Commander unavailablePTO, no responseAssign backup from rotation
Unclear ownershipMulti-team incidentAssign most affected service owner
Status page downMeta-incidentUse Twitter/email for customer comms

Phase 3: Investigation & Mitigation

Root Cause Investigation

Procedure:
  1. Gather context (5 minutes) Check recent deployments:
    # Via Control Plane API
    curl https://api.so1.io/v1/deployments/recent?hours=24 \
      -H "Authorization: Bearer $SO1_API_KEY"
    
    # Via Railway
    railway status --service so1-control-plane-api
    
    Review metrics:
    # Datadog dashboard
    open https://app.datadoghq.com/dashboard/so1-production
    
    # Key metrics:
    # - Error rate trend
    # - Response latency (p50, p95, p99)
    # - Database connection pool utilization
    # - CPU/Memory usage
    
    Check logs:
    # Via Datadog
    # Search: service:control-plane-api status:error
    # Time range: Last 30 minutes
    
    # Look for:
    # - Error patterns
    # - Stack traces
    # - Correlation with deployments
    
  2. Identify root cause (10-30 minutes) Common patterns:
    SymptomLikely CauseInvestigation
    5xx spike + pool at 100%Connection exhaustionCheck pool config vs load
    Latency spikeSlow queries, N+1 patternReview slow query logs
    Gradual degradationMemory leakCheck memory trends over hours
    Sudden outageDeployment issueCompare current vs previous version
    Regional issuesInfrastructure problemCheck Railway/AWS status
  3. Document hypothesis
    # Post in incident channel:
    
    **Hypothesis**: Database connection pool exhaustion
    **Confidence**: 85%
    **Evidence**:
    - Connection pool at 100% utilization
    - Errors: "Connection timeout after 5000ms"
    - Traffic 2x normal due to bulk execution feature
    - Deployed 4 hours ago
    
    **Investigation steps**:
    1. Check connection pool configuration ✅
    2. Review bulk execution concurrency ✅
    3. Assess if pool can be increased 🔄
    

Apply Mitigation

Procedure:
  1. Choose mitigation strategy
    StrategyWhen to UseRiskSpeed
    Configuration changeWrong setting identifiedLowFast (minutes)
    RollbackRecent deployment caused issueMediumFast (5-10 min)
    Scale upResource exhaustionLowMedium (10-15 min)
    Feature flag disableSpecific feature causing issueLowFast (seconds)
    RestartUnknown transient issueMediumFast (1-5 min)
    FailoverPrimary system failedHighMedium (10-20 min)
  2. Execute mitigation Example: Increase connection pool:
    # Via Railway
    railway variables set DATABASE_POOL_SIZE=50 --service control-plane-api
    railway up --service control-plane-api
    
    # Wait for deployment (2-3 minutes)
    railway logs --service control-plane-api --follow
    
    Example: Rollback deployment:
    # Via Railway Deployer agent
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "railway-deployer",
        "inputs": {
          "action": "rollback",
          "service": "control-plane-api",
          "target_version": "previous"
        }
      }'
    
    Example: Disable feature flag:
    # Via feature flag service
    curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{"enabled": false}'
    
  3. Monitor mitigation impact (5-15 minutes)
    # Watch error rate
    # Datadog: Monitor error rate graph
    # Target: Return to &lt;1% within 5 minutes
    
    # Check user impact
    # Datadog: Track successful workflow executions
    # Target: Success rate >95%
    
    # Verify service health
    curl https://api.so1.io/v1/health
    # Expected: {"status": "healthy", "checks": {"database": "ok", ...}}
    
Verification:
  • ✅ Error rate returned to baseline (<1%)
  • ✅ Response latency normalized (<500ms p99)
  • ✅ No new alerts firing
  • ✅ Service health checks passing
  • ✅ User-facing functionality restored
Troubleshooting:
IssueCauseResolution
Mitigation ineffectiveWrong root causeTry alternative hypothesis
Partial improvementMultiple issuesAddress remaining factors
New symptoms appearSide effect of mitigationRollback mitigation, reassess

Phase 4: Resolution & Verification

Confirm Stable Resolution

Procedure:
  1. Monitor for stability (15-30 minutes)
    • Watch key metrics for regression
    • Verify no new related alerts
    • Check user reports (support channels, social media)
    • Confirm database connections stable
  2. Update status communications Internal:
    ✅ **Incident Resolved: API Error Rate Spike**
    
    **Resolution**: Increased database connection pool from 20 to 50 connections
    **Root Cause**: Bulk execution feature exceeded pool capacity
    **Monitoring**: Stable for 30 minutes, no recurrence
    
    **Next Steps**:
    - Postmortem scheduled for tomorrow 2pm
    - Action items: Add pool monitoring, load test bulk features
    
    Thanks to @oncall-backend, @platform-team for rapid response.
    
    Customer (if notified):
    **Resolved - API Performance Issues**
    
    This incident has been resolved. All services are operating normally. 
    We apologize for any inconvenience.
    
  3. Close incident record
    # Update incident status
    curl -X PATCH https://api.so1.io/v1/incidents/INC-20240115-0042 \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "status": "resolved",
        "resolution_summary": "Increased connection pool size",
        "resolved_at": "2024-01-15T14:30:00Z"
      }'
    
  4. Schedule postmortem
    # Create calendar event (24-48 hours after resolution)
    # Title: "Postmortem: INC-20240115-0042 - API Error Rate Spike"
    # Attendees: Commander, responders, affected team leads
    # Duration: 60 minutes
    
Verification:
  • ✅ Metrics stable for 30+ minutes
  • ✅ All stakeholders notified of resolution
  • ✅ Incident record updated to “resolved”
  • ✅ Postmortem scheduled within 48 hours

Phase 5: Postmortem Analysis

Conduct Blameless Postmortem

When: 24-48 hours after incident resolution Procedure:
  1. Invoke Postmortem Analyst agent
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "postmortem-analyst",
        "inputs": {
          "incident_id": "INC-20240115-0042",
          "analysis_method": "5_whys",
          "responders": ["oncall-backend", "platform-team"],
          "timeline_source": "incident_record"
        }
      }'
    
  2. Review generated postmortem draft
    • Verify timeline accuracy
    • Validate root cause analysis
    • Review action items for completeness
    • Add “what went well” and “where we got lucky”
  3. Conduct postmortem meeting (60 minutes)
    • Present timeline and root cause
    • Discuss contributing factors (no blame)
    • Review action items and assign owners
    • Capture additional learnings
  4. Publish final postmortem
    # Location: so1-io/so1-content/postmortems/2024-01-15-api-error-rate.md
    # Share internally via Slack, email
    # Archive in knowledge base
    
Verification:
  • ✅ Root cause identified with >85% confidence
  • ✅ Action items created with owners and due dates
  • ✅ Postmortem published within 72 hours
  • ✅ Learnings incorporated into runbooks

Common Incident Scenarios

Complete Service Outage (SEV0)

Symptoms: All services returning 5xx, users cannot access platform Quick Actions:
  1. Declare SEV0, page all hands
  2. Check Railway infrastructure status
  3. Verify database cluster health
  4. Check DNS resolution
  5. Review recent deployments (last 4 hours)
  6. If deployment-related: Rollback immediately
  7. If infrastructure: Contact Railway support, prepare failover
  8. Communicate to customers within 5 minutes
Related Agents: Triage Responder → Incident Commander → Railway Deployer

Database Failure (SEV0)

Symptoms: Connection errors, query timeouts, replication lag Quick Actions:
  1. Check Railway database metrics (CPU, memory, connections)
  2. Verify primary/replica status
  3. Check for long-running queries: SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes';
  4. If cluster down: Initiate failover to replica
  5. If connection exhaustion: Restart connection pooler
  6. If slow queries: Kill blocking queries
  7. Monitor replication lag after failover
Critical Decision: Failover within 10 minutes if primary unresponsive

Security Breach (SEV0)

Symptoms: Unauthorized access, data exfiltration, anomalous activity Quick Actions:
  1. DO NOT publicly disclose until assessed
  2. Rotate all API keys and credentials immediately
  3. Review audit logs for unauthorized access
  4. Isolate affected services if necessary
  5. Contact security team and legal
  6. Preserve evidence (logs, metrics, database dumps)
  7. Prepare customer communication (coordinate with legal)
Escalation: Immediate executive and legal notification

Emergency Contacts

RoleContact MethodResponse Time
On-Call EngineerPagerDuty<5 minutes
Engineering LeadPagerDuty + Phone<10 minutes
Platform TeamSlack: @platform-team<15 minutes
Railway Supportsupport@railway.app<30 minutes
Executive (SEV0)Phone tree<15 minutes