Skip to main content

Incident Response

Severity

  • SEV-1: production outage or data loss risk
  • SEV-2: major degradation or partial outage
  • SEV-3: minor degradation

Playbook

  • Triage: confirm impact, scope, start an incident log
  • Stabilize: stop the bleeding (rollback, feature flag, rate limit)
  • Diagnose: use dashboards, logs, traces
  • Resolve: apply fix, verify, communicate
  • Learn: write a postmortem and create follow-up issues

Required artifacts

  • Runbook per service (common failures + steps)
  • Postmortem template stored in docs/