Deployment Runbook: Initial Production Deployment
Alert: First-Time Production Deployment
Severity: HIGHDuration: 2-4 hours
Team: Platform Engineering + SRE
1. Pre-Deployment Checklist (1 hour before)
Infrastructure Readiness
- Terraform plan reviewed and approved
- All AWS resources validated (VPC, EKS, RDS, Redis)
- Database backups verified and tested
- DNS records prepared (ready for cutover)
- TLS certificates obtained and verified
- Load balancer configured and health checks passing
- All security groups properly configured
- IAM roles and policies reviewed
Application Readiness
- All tests passing (unit, integration, e2e)
- Code review completed and merged to main
- Docker images built and pushed to registry
- Helm charts validated against production environment
- Configuration secrets validated in AWS Secrets Manager
- Database migrations tested and verified
- Feature flags configured and tested
- Monitoring dashboards created in Storm
Communication Plan
- Announce maintenance window in Slack #status
- Brief support team on expected behavior
- Create incident channel for coordination
- On-call engineers confirmed and briefed
- Customer communication drafted (if applicable)
Observability Verification
- Storm observability stack healthy (all 8 services)
- Prometheus scraping targets verified
- Grafana dashboards loaded successfully
- Jaeger collector accepting traces
- ELK Stack receiving logs
- AlertManager routing alerts correctly
- Sample alerts tested and verified
2. Deployment Execution (2-3 hours)
Phase 1: Infrastructure Provisioning (30 min)
- All Terraform resources created without errors
- VPC with public/private subnets operational
- EKS cluster healthy with all nodes ready
- RDS database created and accessible
- Redis cluster created and accessible
- Security groups properly configured
Phase 2: Database Initialization (20 min)
- All database migrations completed
- Schema validated
- Seed data loaded
- Database accessible from pods
Phase 3: Application Deployment (40 min)
- All pods running and ready
- No image pull errors
- No crash loops
- Application logs show successful startup
Phase 4: Health Verification (20 min)
- API responds to health checks
- Database queries return results
- Cache is accessible
- No 5xx errors in logs
Phase 5: Monitoring Activation (10 min)
- Low error rate (< 0.1%)
- Normal latency (P99 < 5s)
- No 5xx errors in logs
- External connectivity working
3. Post-Deployment Validation (30 min)
Automated Tests
Manual Verification
SLO Verification
4. Rollback Procedure (On Failure)
Decision Tree
If deployment fails during infrastructure provisioning:Rollback Verification
5. Post-Deployment Communication
Success Notification
Failure Notification
6. Monitoring Schedule (Post-Deployment)
First Hour
- Monitor every 5 minutes
- Check error rate, latency, logs
- Alert on any anomalies
First 4 Hours
- Monitor every 15 minutes
- Check all dashboards
- Verify SLO tracking working
First 24 Hours
- Monitor every hour
- Check trend analysis
- Compare against baseline
Beyond 24 Hours
- Normal monitoring
- Watch for delayed issues
- Ready for hotfixes
7. Troubleshooting Guide
Issue: Database Connection Errors
Issue: High Error Rate
Issue: Slow Latency
Success Criteria Summary
| Metric | Threshold | Check |
|---|---|---|
| Pod Health | 100% running | kubectl get pods |
| Error Rate | < 0.1% | Storm Command Center |
| P99 Latency | < 5s | Storm Reliability Dashboard |
| Error Budget | > 99% | Grafana SLO dashboard |
| Database | Healthy | Test connection |
| Cache | Healthy | Test GET/SET |
| DNS | Resolving | nslookup prod.sparki.tools |
Contacts & Escalation
| Role | Contact | Slack |
|---|---|---|
| Platform Lead | @alexarno | #platform |
| SRE On-Call | @on-call-sre | #incidents |
| Database Admin | @dba-team | #database |
| Network Admin | @network-team | #infrastructure |
Document Version: 1.0
Last Updated: December 2025
Status: APPROVED FOR PRODUCTION