Emergency Response Runbook

Incident Classification & Response Matrix

1. Critical: Complete Service Outage

Detection: No requests being processed, all pods down or in crash loop, health checks failing.

Immediate Response (First 5 minutes)

#!/bin/bash
# Execute from any machine with kubectl access

NAMESPACE="sparki-engine"
TIMESTAMP=$(date +%Y-%m-%d_%H-%M-%S)
INCIDENT_ID="INCIDENT-$TIMESTAMP"

echo "🚨 CRITICAL INCIDENT: Complete Service Outage - $INCIDENT_ID"

# 1. Declare Incident
echo "[$(date)] Declaring SEV-1 incident: $INCIDENT_ID"
# Post to #incidents Slack channel (manual or via webhook)
curl -X POST -H 'Content-type: application/json' \
  --data "{\"text\":\"🚨 SEV-1: Complete Service Outage - $INCIDENT_ID. Incident commander assigned. War room open.\"}" \
  $SLACK_WEBHOOK_URL

# 2. Immediate Diagnostic
echo "Collecting diagnostics..."
kubectl get nodes -o wide > $INCIDENT_ID-nodes.txt
kubectl get pods -n $NAMESPACE -o wide > $INCIDENT_ID-pods.txt
kubectl get svc -n $NAMESPACE -o wide > $INCIDENT_ID-services.txt
kubectl events -n $NAMESPACE --sort-by='.lastTimestamp' > $INCIDENT_ID-events.txt

# 3. Check Service Status
echo "Service selector:"
kubectl get svc sparki-engine-lb -n $NAMESPACE -o jsonpath='{.spec.selector}' | jq '.'

# 4. Check All Deployments
echo "Deployment status:"
kubectl get deployments -n $NAMESPACE -o wide

# 5. Check Recent Pod Errors
echo "Recent pod errors:"
kubectl describe pod -n $NAMESPACE | grep -A 5 "Error\|Failed\|CrashLoop"

Decision Tree

Service Down?
├─ Yes, pods not running/CrashLoop
│  ├─ Node failures → See "Node Failure" section
│  ├─ Image pull errors → Check ECR, image tag
│  ├─ Database unavailable → Check RDS connectivity
│  └─ Resource exhaustion → Scale up or free resources
└─ Pods running but not responding
   ├─ Check application logs → Look for panic/crashes
   ├─ Check health endpoint → Curl from pod
   └─ Check external connectivity → DNS, load balancer

Mitigation Steps (In Priority Order)

Option 1: Scale Horizontally (Quick)

# If pods are crashing but some are recovering:
kubectl scale deployment sparki-engine-blue -n $NAMESPACE --replicas=5

# Wait for recovery
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE --timeout=300s

# If still failing, continue to Option 2

Option 2: Restart Deployment (Medium)

# Force restart all pods
kubectl rollout restart deployment/sparki-engine-blue -n $NAMESPACE

# Monitor restart
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Wait 30 seconds for system stabilization
sleep 30

# Check if recovered
curl http://sparki-engine-lb:8080/health

Option 3: Revert to Last Known Good (Nuclear)

# Revert to previous version immediately
kubectl rollout undo deployment/sparki-engine-blue -n $NAMESPACE

# Wait for rollout
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE --timeout=300s

# Verify service restored
sleep 10
curl http://sparki-engine-lb:8080/health

Option 4: Route to Backup Cluster (Last Resort)

# Update load balancer to point to standby cluster
# This requires pre-configured secondary infrastructure

aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456789ABC \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "prod.sparki.tools",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z9999",
          "DNSName": "backup-cluster-lb.region.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

echo "✅ Routing to backup cluster (if available)"

Success Criteria

Service responding to health checks
Error rate < 1%
At least 2 pods running and healthy
Load balancer detecting healthy backends
External connectivity restored

Escalation Path

5 min: No improvement → Activate backup cluster
10 min: No improvement → Page database admin (check RDS)
15 min: No improvement → Page network admin (check load balancer)
20 min: No improvement → Incident commander initiates manual failover

2. Critical: High Error Rate (>5%)

Detection: Error rate spike above 5% sustained for > 2 minutes.

Immediate Response

#!/bin/bash
NAMESPACE="sparki-engine"
ERROR_THRESHOLD=5.0

echo "🚨 CRITICAL: High Error Rate Detected"

# 1. Quantify Issue
ERROR_RATE=$(kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=1000 | grep -c "ERROR" | awk '{print ($1/10)}')
echo "Current error rate: $ERROR_RATE%"

# 2. Check Which Component is Failing
echo "Analyzing error logs..."
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=100 | grep "ERROR" | cut -d' ' -f3-5 | sort | uniq -c | sort -rn

# 3. Categorize Errors
echo "Error categories:"
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=100 | grep "ERROR" | awk '{print $(NF-1), $NF}' | sort | uniq -c

Root Cause Analysis Matrix

Error Pattern	Likely Cause	Action
`database: connection refused`	RDS down or unreachable	Check RDS, security groups
`redis: timeout`	Redis overloaded or down	Check Redis, scale cache
`http: 502 bad gateway`	App ports not responding	Restart pods
`panic: nil pointer`	Application bug	Rollback to previous version
`OOM killed`	Memory exhaustion	Scale pods or increase memory
`unauthorized: invalid token`	Auth service failure	Check OAuth/JWT provider

Investigation Commands

# Check Application Health
for pod in $(kubectl get pods -n $NAMESPACE -l app=sparki-engine -o name); do
  echo "Pod: $pod"
  kubectl exec $pod -n $NAMESPACE -- curl -s http://localhost:8080/health
done

# Check Database Connectivity
POD=$(kubectl get pods -n $NAMESPACE -l app=sparki-engine -o name | head -1)
kubectl exec $POD -n $NAMESPACE -- psql \
  -h $DB_HOST -U $DB_USER -d sparki \
  -c "SELECT COUNT(*) as connection_test;"

# Check Cache Connectivity
kubectl exec $POD -n $NAMESPACE -- redis-cli \
  -h $REDIS_HOST \
  ping

# Check Pod Resource Limits
kubectl describe pod $POD -n $NAMESPACE | grep -A 3 "Limits\|Requests"

# Check Recent Restarts
kubectl get pods -n $NAMESPACE -o jsonpath='{.items[*].status.containerStatuses[?(@.restartCount>0)].name}'

Mitigation Options

Option 1: Scale Up (If Resource Constrained)

kubectl scale deployment sparki-engine-blue -n $NAMESPACE --replicas=6
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Monitor improvement
sleep 30
ERROR_RATE=$(kubectl logs --tail=100 | grep -c ERROR | awk '{print ($1/1)}')
echo "New error rate: $ERROR_RATE%"

Option 2: Restart Failing Pods

# Kill and restart specific failing pods
kubectl delete pod -n $NAMESPACE -l app=sparki-engine --grace-period=10

# Wait for replacement pods
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Check error rate improvement
sleep 30

Option 3: Emergency Rollback

kubectl rollout undo deployment/sparki-engine-blue -n $NAMESPACE
kubectl rollout status deployment/sparki-engine-blue -n $NAMESPACE

# Verify error rate dropped
sleep 30
ERROR_RATE=$(kubectl logs --tail=100 | grep -c ERROR)
echo "Post-rollback error rate: $ERROR_RATE%"

Option 4: Circuit Breaker (Disable Feature)

# If errors from specific feature, disable via feature flag
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  FEATURE_DETECTION_ENABLED=false

# Monitor impact
sleep 60

Success Criteria

Error rate < 1%
Error logs show root cause identified
Action taken (scale, rollback, or disable feature)
Monitoring shows recovery

Escalation

2 min: Page on-call engineer if still failing
5 min: Page tech lead and incident commander
10 min: Initiate major incident response

3. Major: High Latency (P99 > 10s)

Detection: P99 latency sustained > 10 seconds for > 5 minutes.

Diagnosis Flow

#!/bin/bash
echo "⚠️ MAJOR: High Latency Incident"

# 1. Check Database Performance
echo "Database query performance (from logs):"
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE \
  --tail=200 | grep "query_duration" | tail -10

# 2. Check Network Latency
POD=$(kubectl get pods -n $NAMESPACE -o name | head -1)
echo "Network to RDS:"
kubectl exec $POD -- ping -c 3 $DB_HOST

# 3. Check CPU and Memory
echo "Pod resource usage:"
kubectl top pod -n $NAMESPACE

echo "Node resource usage:"
kubectl top node

# 4. Check Database Connections
echo "Database connection count:"
kubectl exec $POD -- psql -h $DB_HOST -U $DB_USER -d sparki \
  -c "SELECT count(*) as connections FROM pg_stat_activity;"

# 5. Check Slow Queries
echo "Slow queries in RDS:"
# Check RDS Performance Insights console or query pg_stat_statements

Common Causes & Fixes

Symptom	Cause	Fix
All queries slow	Database overload	Scale RDS (more CPU/memory)
Specific query slow	Missing index	Add index or optimize query
Intermittent slowness	Network congestion	Check NAT Gateway limits
Increasing latency	Memory leak	Restart pods
Latency with errors	Timeout threshold exceeded	Increase timeout or scale

Quick Fixes (In Order)

# 1. Restart Pods (Simple Memory Reset)
kubectl rollout restart deployment/sparki-engine-blue -n $NAMESPACE
sleep 60
# Check latency improvement

# 2. Scale RDS Vertically (if database constrained)
# Via AWS Console or CLI:
# - Increase instance class (e.g., db.t3.small → db.t3.medium)
# - This causes brief ~1 minute outage
# - Verify read replicas (if using multi-AZ)

# 3. Enable Query Caching
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  ENABLE_QUERY_CACHE=true \
  CACHE_TTL_SECONDS=300

# 4. Add Connection Pooling
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  DATABASE_POOL_SIZE=25 \
  DATABASE_POOL_TIMEOUT=10

# 5. Disable Non-Critical Features
kubectl set env deployment/sparki-engine-blue \
  -n $NAMESPACE \
  FEATURE_ANALYTICS_ENABLED=false

Success Criteria

P99 latency < 5s
P95 latency < 2s
P50 latency < 500ms
Request success rate > 99%

4. Major: Database Connectivity Issues

Detection: All database operations timing out or rejected.