Emergency Response Runbook
Incident Classification & Response Matrix
1. Critical: Complete Service Outage
Detection: No requests being processed, all pods down or in crash loop, health checks failing.Immediate Response (First 5 minutes)
Decision Tree
Mitigation Steps (In Priority Order)
Option 1: Scale Horizontally (Quick)Success Criteria
- Service responding to health checks
- Error rate < 1%
- At least 2 pods running and healthy
- Load balancer detecting healthy backends
- External connectivity restored
Escalation Path
- 5 min: No improvement → Activate backup cluster
- 10 min: No improvement → Page database admin (check RDS)
- 15 min: No improvement → Page network admin (check load balancer)
- 20 min: No improvement → Incident commander initiates manual failover
2. Critical: High Error Rate (>5%)
Detection: Error rate spike above 5% sustained for > 2 minutes.Immediate Response
Root Cause Analysis Matrix
| Error Pattern | Likely Cause | Action |
|---|---|---|
database: connection refused | RDS down or unreachable | Check RDS, security groups |
redis: timeout | Redis overloaded or down | Check Redis, scale cache |
http: 502 bad gateway | App ports not responding | Restart pods |
panic: nil pointer | Application bug | Rollback to previous version |
OOM killed | Memory exhaustion | Scale pods or increase memory |
unauthorized: invalid token | Auth service failure | Check OAuth/JWT provider |
Investigation Commands
Mitigation Options
Option 1: Scale Up (If Resource Constrained)Success Criteria
- Error rate < 1%
- Error logs show root cause identified
- Action taken (scale, rollback, or disable feature)
- Monitoring shows recovery
Escalation
- 2 min: Page on-call engineer if still failing
- 5 min: Page tech lead and incident commander
- 10 min: Initiate major incident response
3. Major: High Latency (P99 > 10s)
Detection: P99 latency sustained > 10 seconds for > 5 minutes.Diagnosis Flow
Common Causes & Fixes
| Symptom | Cause | Fix |
|---|---|---|
| All queries slow | Database overload | Scale RDS (more CPU/memory) |
| Specific query slow | Missing index | Add index or optimize query |
| Intermittent slowness | Network congestion | Check NAT Gateway limits |
| Increasing latency | Memory leak | Restart pods |
| Latency with errors | Timeout threshold exceeded | Increase timeout or scale |
Quick Fixes (In Order)
Success Criteria
- P99 latency < 5s
- P95 latency < 2s
- P50 latency < 500ms
- Request success rate > 99%
4. Major: Database Connectivity Issues
Detection: All database operations timing out or rejected.Emergency Response
Troubleshooting Matrix
| Test Fails | Problem | Fix |
|---|---|---|
| RDS status: not available | DB instance down | Wait for AWS to recover or restore from backup |
| nc -zv timeout | Network blocked | Check security groups, NACLs |
| Ingress rules empty | Security group modified | Add rule for worker node security group |
| NAT Gateway: no available | Quota exceeded | Scale NAT or use VPC endpoints |
| Connection pool exhausted | Leak in app | Restart pods, check for leaked connections |
Recovery Steps (By Severity)
If RDS is DOWN (reported by AWS):Success Criteria
- Network connectivity restored (nc -zv passes)
- Queries executing successfully
- No timeout errors in logs
- Connection pool healthy
5. Monitoring & Alerting Rules During Incident
Slack Alerting Template
Dashboard Links During Incident
| Issue | Dashboard |
|---|---|
| Error rate high | Command Center |
| Latency high | Reliability SLO |
| Database issues | Infrastructure |
| Traces showing errors | Jaeger |
| Detailed logs | Kibana |
| Service metrics | Prometheus |
6. Communication During Incident
Update Frequency
| Duration | Interval | Channel |
|---|---|---|
| First 15 min | Every 5 min | #incidents Slack |
| 15-60 min | Every 15 min | #incidents Slack |
| Beyond 60 min | Every 30 min | #status + #incidents |
| Resolved | Post mortem | #status + #postmortems |
Status Update Template
7. Post-Incident Checklist
Emergency Contacts
| Role | Slack | Phone | Pager |
|---|---|---|---|
| On-Call SRE | @on-call-sre | +1-XXX-SRE-XXXX | PagerDuty |
| Incident Commander | @incident-commander | +1-XXX-INC-XXXX | PagerDuty |
| Platform Lead | @alexarno | +1-XXX-PLAT-XXXX | PagerDuty |
| Database DBA | @dba-team | +1-XXX-DBA-XXXX | PagerDuty |
Useful Commands Reference
Document Version: 1.0
Last Updated: December 2025
Review Cycle: Quarterly