Incident Response
Severity
- SEV-1: production outage or data loss risk
- SEV-2: major degradation or partial outage
- SEV-3: minor degradation
Playbook
- Triage: confirm impact, scope, start an incident log
- Stabilize: stop the bleeding (rollback, feature flag, rate limit)
- Diagnose: use dashboards, logs, traces
- Resolve: apply fix, verify, communicate
- Learn: write a postmortem and create follow-up issues
Required artifacts
- Runbook per service (common failures + steps)
- Postmortem template stored in
docs/