Skip to main content

Emergency Response Runbook

Incident Classification & Response Matrix


1. Critical: Complete Service Outage

Detection: No requests being processed, all pods down or in crash loop, health checks failing.

Immediate Response (First 5 minutes)

#!/bin/bash
# Execute from any machine with kubectl access

NAMESPACE="sparki-engine"
TIMESTAMP=$(date +%Y-%m-%d_%H-%M-%S)
INCIDENT_ID="INCIDENT-$TIMESTAMP"

echo "🚨 CRITICAL INCIDENT: Complete Service Outage - $INCIDENT_ID"

# 1. Declare Incident
echo "[$(date)] Declaring SEV-1 incident: $INCIDENT_ID"
# Post to #incidents Slack channel (manual or via webhook)
curl -X POST -H 'Content-type: application/json' \
  --data "{\"text\":\"🚨 SEV-1: Complete Service Outage - $INCIDENT_ID. Incident commander assigned. War room open.\"}" \
  $SLACK_WEBHOOK_URL

# 2. Immediate Diagnostic
echo "Collecting diagnostics..."
kubectl get nodes -o wide > $INCIDENT_ID-nodes.txt
kubectl get pods -n $NAMESPACE -o wide > $INCIDENT_ID-pods.txt
kubectl get svc -n $NAMESPACE -o wide > $INCIDENT_ID-services.txt
kubectl events -n $NAMESPACE --sort-by='.lastTimestamp' > $INCIDENT_ID-events.txt

# 3. Check Service Status
echo "Service selector:"
kubectl get svc sparki-engine-lb -n $NAMESPACE -o jsonpath='{.spec.selector}' | jq '.'

# 4. Check All Deployments
echo "Deployment status:"
kubectl get deployments -n $NAMESPACE -o wide

# 5. Check Recent Pod Errors
echo "Recent pod errors:"
kubectl describe pod -n $NAMESPACE | grep -A 5 "Error\|Failed\|CrashLoop"

Decision Tree

Service Down?
├─ Yes, pods not running/CrashLoop
│  ├─ Node failures → See "Node Failure" section
│  ├─ Image pull errors → Check ECR, image tag
│  ├─ Database unavailable → Check RDS connectivity
│  └─ Resource exhaustion → Scale up or free resources
└─ Pods running but not responding
   ├─ Check application logs → Look for panic/crashes
   ├─ Check health endpoint → Curl from pod
   └─ Check external connectivity → DNS, load balancer

Mitigation Steps (In Priority Order)

Option 1: Scale Horizontally (Quick)
# If pods are crashing but some are recovering:
kubectl scale deployment sparki-engine-blue -n $NAMESPACE --replicas=5

# Wait for recovery
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE --timeout=300s

# If still failing, continue to Option 2
Option 2: Restart Deployment (Medium)
# Force restart all pods
kubectl rollout restart deployment/sparki-engine-blue -n $NAMESPACE

# Monitor restart
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Wait 30 seconds for system stabilization
sleep 30

# Check if recovered
curl http://sparki-engine-lb:8080/health
Option 3: Revert to Last Known Good (Nuclear)
# Revert to previous version immediately
kubectl rollout undo deployment/sparki-engine-blue -n $NAMESPACE

# Wait for rollout
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE --timeout=300s

# Verify service restored
sleep 10
curl http://sparki-engine-lb:8080/health
Option 4: Route to Backup Cluster (Last Resort)
# Update load balancer to point to standby cluster
# This requires pre-configured secondary infrastructure

aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456789ABC \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "prod.sparki.tools",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z9999",
          "DNSName": "backup-cluster-lb.region.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

echo "✅ Routing to backup cluster (if available)"

Success Criteria

  • Service responding to health checks
  • Error rate < 1%
  • At least 2 pods running and healthy
  • Load balancer detecting healthy backends
  • External connectivity restored

Escalation Path

  • 5 min: No improvement → Activate backup cluster
  • 10 min: No improvement → Page database admin (check RDS)
  • 15 min: No improvement → Page network admin (check load balancer)
  • 20 min: No improvement → Incident commander initiates manual failover

2. Critical: High Error Rate (>5%)

Detection: Error rate spike above 5% sustained for > 2 minutes.

Immediate Response

#!/bin/bash
NAMESPACE="sparki-engine"
ERROR_THRESHOLD=5.0

echo "🚨 CRITICAL: High Error Rate Detected"

# 1. Quantify Issue
ERROR_RATE=$(kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=1000 | grep -c "ERROR" | awk '{print ($1/10)}')
echo "Current error rate: $ERROR_RATE%"

# 2. Check Which Component is Failing
echo "Analyzing error logs..."
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=100 | grep "ERROR" | cut -d' ' -f3-5 | sort | uniq -c | sort -rn

# 3. Categorize Errors
echo "Error categories:"
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=100 | grep "ERROR" | awk '{print $(NF-1), $NF}' | sort | uniq -c

Root Cause Analysis Matrix

Error PatternLikely CauseAction
database: connection refusedRDS down or unreachableCheck RDS, security groups
redis: timeoutRedis overloaded or downCheck Redis, scale cache
http: 502 bad gatewayApp ports not respondingRestart pods
panic: nil pointerApplication bugRollback to previous version
OOM killedMemory exhaustionScale pods or increase memory
unauthorized: invalid tokenAuth service failureCheck OAuth/JWT provider

Investigation Commands

# Check Application Health
for pod in $(kubectl get pods -n $NAMESPACE -l app=sparki-engine -o name); do
  echo "Pod: $pod"
  kubectl exec $pod -n $NAMESPACE -- curl -s http://localhost:8080/health
done

# Check Database Connectivity
POD=$(kubectl get pods -n $NAMESPACE -l app=sparki-engine -o name | head -1)
kubectl exec $POD -n $NAMESPACE -- psql \
  -h $DB_HOST -U $DB_USER -d sparki \
  -c "SELECT COUNT(*) as connection_test;"

# Check Cache Connectivity
kubectl exec $POD -n $NAMESPACE -- redis-cli \
  -h $REDIS_HOST \
  ping

# Check Pod Resource Limits
kubectl describe pod $POD -n $NAMESPACE | grep -A 3 "Limits\|Requests"

# Check Recent Restarts
kubectl get pods -n $NAMESPACE -o jsonpath='{.items[*].status.containerStatuses[?(@.restartCount>0)].name}'

Mitigation Options

Option 1: Scale Up (If Resource Constrained)
kubectl scale deployment sparki-engine-blue -n $NAMESPACE --replicas=6
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Monitor improvement
sleep 30
ERROR_RATE=$(kubectl logs --tail=100 | grep -c ERROR | awk '{print ($1/1)}')
echo "New error rate: $ERROR_RATE%"
Option 2: Restart Failing Pods
# Kill and restart specific failing pods
kubectl delete pod -n $NAMESPACE -l app=sparki-engine --grace-period=10

# Wait for replacement pods
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Check error rate improvement
sleep 30
Option 3: Emergency Rollback
kubectl rollout undo deployment/sparki-engine-blue -n $NAMESPACE
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Verify error rate dropped
sleep 30
ERROR_RATE=$(kubectl logs --tail=100 | grep -c ERROR)
echo "Post-rollback error rate: $ERROR_RATE%"
Option 4: Circuit Breaker (Disable Feature)
# If errors from specific feature, disable via feature flag
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  FEATURE_DETECTION_ENABLED=false

# Monitor impact
sleep 60

Success Criteria

  • Error rate < 1%
  • Error logs show root cause identified
  • Action taken (scale, rollback, or disable feature)
  • Monitoring shows recovery

Escalation

  • 2 min: Page on-call engineer if still failing
  • 5 min: Page tech lead and incident commander
  • 10 min: Initiate major incident response

3. Major: High Latency (P99 > 10s)

Detection: P99 latency sustained > 10 seconds for > 5 minutes.

Diagnosis Flow

#!/bin/bash
echo "⚠️ MAJOR: High Latency Incident"

# 1. Check Database Performance
echo "Database query performance (from logs):"
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=200 | grep "query_duration" | tail -10

# 2. Check Network Latency
POD=$(kubectl get pods -n $NAMESPACE -o name | head -1)
echo "Network to RDS:"
kubectl exec $POD -- ping -c 3 $DB_HOST

# 3. Check CPU and Memory
echo "Pod resource usage:"
kubectl top pod -n $NAMESPACE

echo "Node resource usage:"
kubectl top node

# 4. Check Database Connections
echo "Database connection count:"
kubectl exec $POD -- psql -h $DB_HOST -U $DB_USER -d sparki \
  -c "SELECT count(*) as connections FROM pg_stat_activity;"

# 5. Check Slow Queries
echo "Slow queries in RDS:"
# Check RDS Performance Insights console or query pg_stat_statements

Common Causes & Fixes

SymptomCauseFix
All queries slowDatabase overloadScale RDS (more CPU/memory)
Specific query slowMissing indexAdd index or optimize query
Intermittent slownessNetwork congestionCheck NAT Gateway limits
Increasing latencyMemory leakRestart pods
Latency with errorsTimeout threshold exceededIncrease timeout or scale

Quick Fixes (In Order)

# 1. Restart Pods (Simple Memory Reset)
kubectl rollout restart deployment/sparki-engine-blue -n $NAMESPACE
sleep 60
# Check latency improvement

# 2. Scale RDS Vertically (if database constrained)
# Via AWS Console or CLI:
# - Increase instance class (e.g., db.t3.small → db.t3.medium)
# - This causes brief ~1 minute outage
# - Verify read replicas (if using multi-AZ)

# 3. Enable Query Caching
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  ENABLE_QUERY_CACHE=true \
  CACHE_TTL_SECONDS=300

# 4. Add Connection Pooling
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  DATABASE_POOL_SIZE=25 \
  DATABASE_POOL_TIMEOUT=10

# 5. Disable Non-Critical Features
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  FEATURE_ANALYTICS_ENABLED=false

Success Criteria

  • P99 latency < 5s
  • P95 latency < 2s
  • P50 latency < 500ms
  • Request success rate > 99%

4. Major: Database Connectivity Issues

Detection: All database operations timing out or rejected.

Emergency Response

#!/bin/bash
NAMESPACE="sparki-engine"

echo "🚨 Database Connectivity Issues Detected"

# 1. Verify RDS Status
echo "RDS Instance Status:"
aws rds describe-db-instances \
  --db-instance-identifier sparki-prod \
  --query 'DBInstances[0].[DBInstanceStatus,DBInstanceIdentifier,DBInstanceClass]'

# 2. Check Security Groups
echo "RDS Security Group Ingress Rules:"
aws ec2 describe-security-groups \
  --group-ids sg-xxxxxx \
  --query 'SecurityGroups[0].IpPermissions'

# 3. Check Network Connectivity from Pod
POD=$(kubectl get pods -n $NAMESPACE -o name | head -1)
echo "Testing connectivity:"
kubectl exec $POD -n $NAMESPACE -- \
  nc -zv $DB_HOST 5432

# 4. Check Connection Pool Status
kubectl exec $POD -n $NAMESPACE -- \
  curl -s http://localhost:8080/metrics | grep "database_connections"

# 5. Check Pod Network Interface
kubectl exec $POD -n $NAMESPACE -- \
  ip route show

# 6. Check NAT Gateway (if using private subnet)
aws ec2 describe-nat-gateways \
  --filter "Name=tag:Environment,Values=prod" \
  --query 'NatGateways[*].[NatGatewayId,State,BytesOutToDestination]'

Troubleshooting Matrix

Test FailsProblemFix
RDS status: not availableDB instance downWait for AWS to recover or restore from backup
nc -zv timeoutNetwork blockedCheck security groups, NACLs
Ingress rules emptySecurity group modifiedAdd rule for worker node security group
NAT Gateway: no availableQuota exceededScale NAT or use VPC endpoints
Connection pool exhaustedLeak in appRestart pods, check for leaked connections

Recovery Steps (By Severity)

If RDS is DOWN (reported by AWS):
# 1. Wait 5-10 minutes for AWS automated recovery
# 2. Check recent events
aws rds describe-events \
  --source-identifier sparki-prod \
  --source-type db-instance \
  --query 'Events[0:5]'

# 3. If still down after 15 min, restore from backup
./infrastructure/scripts/restore-rds-from-backup.sh sparki-prod
If RDS is UP but unreachable:
# 1. Verify security group
WORKER_SG=$(kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' | cut -d/ -f5)
RDS_SG=$(aws ec2 describe-security-groups \
  --query "SecurityGroups[?groupName=='sparki-rds-sg'].GroupId" -o text)

# 2. Add security group rule
aws ec2 authorize-security-group-ingress \
  --group-id $RDS_SG \
  --source-group $WORKER_SG \
  --protocol tcp \
  --port 5432

# 3. Test from pod
kubectl delete pod -n $NAMESPACE -l app=sparki-engine
sleep 30
If NAT Gateway is problematic:
# 1. Create additional NAT Gateway (if quota allows)
aws ec2 allocate-address --domain vpc
aws ec2 create-nat-gateway \
  --subnet-id subnet-xxxxx \
  --allocation-id eipalloc-xxxxx

# 2. Or: Scale down pods to reduce egress traffic
kubectl scale deployment sparki-engine-blue -n $NAMESPACE --replicas=2

Success Criteria

  • Network connectivity restored (nc -zv passes)
  • Queries executing successfully
  • No timeout errors in logs
  • Connection pool healthy

5. Monitoring & Alerting Rules During Incident

Slack Alerting Template

#!/bin/bash
# Send to #incidents channel

INCIDENT_ID="$1"
SEVERITY="$2"  # SEV-1, SEV-2, etc.
TITLE="$3"
DESCRIPTION="$4"
ACTION="$5"

SEVERITY_EMOJI="🚨"
[[ "$SEVERITY" == "SEV-2" ]] && SEVERITY_EMOJI="⚠️"
[[ "$SEVERITY" == "SEV-3" ]] && SEVERITY_EMOJI="ℹ️"

curl -X POST -H 'Content-type: application/json' \
  --data "{
    \"attachments\": [{
      \"color\": \"danger\",
      \"title\": \"$SEVERITY_EMOJI $SEVERITY: $TITLE\",
      \"text\": \"$DESCRIPTION\",
      \"fields\": [
        {\"title\": \"Incident ID\", \"value\": \"$INCIDENT_ID\", \"short\": true},
        {\"title\": \"Severity\", \"value\": \"$SEVERITY\", \"short\": true},
        {\"title\": \"Time\", \"value\": \"$(date)\", \"short\": true},
        {\"title\": \"Action\", \"value\": \"$ACTION\", \"short\": false}
      ],
      \"actions\": [
        {\"type\": \"button\", \"text\": \"Open Grafana\", \"url\": \"https://grafana.sparki.tools\"},
        {\"type\": \"button\", \"text\": \"View Logs\", \"url\": \"https://logs.sparki.tools\"}
      ]
    }]
  }" \
  $SLACK_WEBHOOK_URL
IssueDashboard
Error rate highCommand Center
Latency highReliability SLO
Database issuesInfrastructure
Traces showing errorsJaeger
Detailed logsKibana
Service metricsPrometheus

6. Communication During Incident

Update Frequency

DurationIntervalChannel
First 15 minEvery 5 min#incidents Slack
15-60 minEvery 15 min#incidents Slack
Beyond 60 minEvery 30 min#status + #incidents
ResolvedPost mortem#status + #postmortems

Status Update Template

[HH:MM] Update #N

Status: [Investigating | Mitigating | Recovered]
Impact: [% users affected, error rate, latency]
Root Cause: [identified | under investigation]
Action Taken: [steps taken in last 15 min]
Next Steps: [what we're doing next]
ETA to Resolution: [estimate if known]

7. Post-Incident Checklist

#!/bin/bash

echo "Post-Incident Checklist - INCIDENT-$1"

# 1. Collect All Artifacts
mkdir -p /incidents/$1
kubectl logs deployment/sparki-engine-blue -n sparki-engine \
  --since=30m --all-containers > /incidents/$1/pod-logs.txt
kubectl describe nodes > /incidents/$1/node-status.txt
kubectl describe deployment -n sparki-engine > /incidents/$1/deployment-status.txt

# 2. Export Metrics
# Query Prometheus for metrics during incident window
# Export to CSV for analysis

# 3. Export Traces
# Query Jaeger for errors during incident window
# Export for root cause analysis

# 4. Create Incident Report
cat > /incidents/$1/incident-report.md <<EOF
# Incident Report: $1

## Summary
[One paragraph executive summary]

## Timeline
- HH:MM: Issue detected
- HH:MM: First response action
- HH:MM: Mitigation applied
- HH:MM: Service recovered
- HH:MM: Incident closed

## Root Cause
[What actually caused it]

## Impact
- Duration: X minutes
- Users affected: X%
- Error rate: X%
- SLO impact: X minutes of error budget

## Resolution
[What we did to fix it]

## Prevention
[What prevents this in future]

## Action Items
- [ ] Fix root cause (assign owner + deadline)
- [ ] Add monitoring for this scenario
- [ ] Update runbooks
- [ ] Post-incident review meeting
EOF

echo "✅ Incident artifacts collected"
echo "📋 Review report at: /incidents/$1/incident-report.md"

Emergency Contacts

RoleSlackPhonePager
On-Call SRE@on-call-sre+1-XXX-SRE-XXXXPagerDuty
Incident Commander@incident-commander+1-XXX-INC-XXXXPagerDuty
Platform Lead@alexarno+1-XXX-PLAT-XXXXPagerDuty
Database DBA@dba-team+1-XXX-DBA-XXXXPagerDuty

Useful Commands Reference

# View All Incident Commands
alias sparki-incident='kubectl -n sparki-engine'

# Quick Status Check
sparki-incident get pods,svc,deployment -o wide

# Tail Logs
sparki-incident logs -f deployment/sparki-engine-blue --all-containers

# Execute Commands in Pod
POD=$(sparki-incident get po -o name | head -1)
sparki-incident exec $POD -- /bin/bash

# Check Events
sparki-incident events --sort-by='.lastTimestamp'

# Delete and Restart
sparki-incident delete pod -l app=sparki-engine

# Scale Replicas
sparki-incident scale deployment/sparki-engine-blue --replicas=5

Document Version: 1.0
Last Updated: December 2025
Review Cycle: Quarterly