Deployment Runbook: Initial Production Deployment

Alert: First-Time Production Deployment

Severity: HIGH
Duration: 2-4 hours
Team: Platform Engineering + SRE

1. Pre-Deployment Checklist (1 hour before)

Infrastructure Readiness

Terraform plan reviewed and approved
All AWS resources validated (VPC, EKS, RDS, Redis)
Database backups verified and tested
DNS records prepared (ready for cutover)
TLS certificates obtained and verified
Load balancer configured and health checks passing
All security groups properly configured
IAM roles and policies reviewed

Application Readiness

All tests passing (unit, integration, e2e)
Code review completed and merged to main
Docker images built and pushed to registry
Helm charts validated against production environment
Configuration secrets validated in AWS Secrets Manager
Database migrations tested and verified
Feature flags configured and tested
Monitoring dashboards created in Storm

Communication Plan

Announce maintenance window in Slack #status
Brief support team on expected behavior
Create incident channel for coordination
On-call engineers confirmed and briefed
Customer communication drafted (if applicable)

Observability Verification

Storm observability stack healthy (all 8 services)
Prometheus scraping targets verified
Grafana dashboards loaded successfully
Jaeger collector accepting traces
ELK Stack receiving logs
AlertManager routing alerts correctly
Sample alerts tested and verified

2. Deployment Execution (2-3 hours)

Phase 1: Infrastructure Provisioning (30 min)

# 1. Prepare Terraform
cd infrastructure/terraform
terraform init -upgrade
terraform fmt -recursive
terraform validate

# 2. Plan Infrastructure
terraform plan \
  -var-file="environments/prod.tfvars" \
  -out=tfplan

# 3. Review Plan Output
# ❗ MUST review all changes before proceeding

# 4. Apply Infrastructure
terraform apply tfplan

# 5. Capture Outputs
terraform output -json > outputs.json

Success Criteria:

All Terraform resources created without errors
VPC with public/private subnets operational
EKS cluster healthy with all nodes ready
RDS database created and accessible
Redis cluster created and accessible
Security groups properly configured

Phase 2: Database Initialization (20 min)

# 1. Get Database Endpoint
DB_ENDPOINT=$(terraform output -raw database_endpoint)

# 2. Run Migrations
kubectl run migration-job \
  --image=ghcr.io/alexarno/sparki/engine:latest \
  --env="DATABASE_URL=postgres://${DB_USER}:${DB_PASS}@${DB_ENDPOINT}:5432/sparki" \
  -n sparki-engine \
  --command -- /app/migrate up

# 3. Verify Migrations
kubectl logs -f job/migration-job -n sparki-engine

# 4. Create Initial Data
kubectl run seed-job \
  --image=ghcr.io/alexarno/sparki/engine:latest \
  -n sparki-engine \
  --command -- /app/seed-data

Success Criteria:

All database migrations completed
Schema validated
Seed data loaded
Database accessible from pods

Phase 3: Application Deployment (40 min)

# 1. Deploy Engine
./infrastructure/scripts/deploy.sh prod latest

# 2. Deploy Web UI
./infrastructure/scripts/deploy-web.sh prod latest

# 3. Verify Deployments
kubectl get deployments -n sparki-engine
kubectl get deployments -n sparki-web
kubectl get pods -n sparki-engine
kubectl get pods -n sparki-web

# 4. Check Logs for Errors
kubectl logs -f deployment/sparki-engine -n sparki-engine --tail=50
kubectl logs -f deployment/sparki-web -n sparki-web --tail=50

Success Criteria:

All pods running and ready
No image pull errors
No crash loops
Application logs show successful startup

Phase 4: Health Verification (20 min)

# 1. Run Health Checks
./infrastructure/scripts/health-check.sh prod

# 2. Test API Endpoints
POD=$(kubectl get pods -n sparki-engine -o name | head -1)
kubectl exec $POD -n sparki-engine -- \
  curl -v http://localhost:8080/health

# 3. Test Database Connectivity
kubectl exec $POD -n sparki-engine -- \
  curl -v http://localhost:8080/api/projects

# 4. Test Cache Connectivity
kubectl exec $POD -n sparki-engine -- \
  curl -v http://localhost:8080/api/cache/health

Success Criteria:

API responds to health checks
Database queries return results
Cache is accessible
No 5xx errors in logs

Phase 5: Monitoring Activation (10 min)

# 1. Update DNS (if not pre-provisioned)
# Update DNS to point to new load balancer

# 2. Monitor Error Rate
# Check Storm Command Center dashboard
# - Error rate should be < 0.1%
# - P99 latency should be < 5s

# 3. Monitor Log Volume
# Check Kibana for any ERROR level logs
# Review recent ERROR entries

# 4. Test from Outside
curl -v https://prod.sparki.tools/health

Success Criteria:

Low error rate (< 0.1%)
Normal latency (P99 < 5s)
No 5xx errors in logs
External connectivity working

3. Post-Deployment Validation (30 min)

Automated Tests

# 1. Run Smoke Tests
./infrastructure/scripts/smoke-tests.sh prod

# 2. Run E2E Tests
kubectl run e2e-tests \
  --image=ghcr.io/alexarno/sparki/e2e:latest \
  -n sparki-test \
  --env="APP_URL=https://prod.sparki.tools" \
  --wait

Manual Verification

# 1. Create Test Project
curl -X POST https://prod.sparki.tools/api/projects \
  -H "Content-Type: application/json" \
  -d '{"name": "test-project"}'

# 2. Trigger Detection
curl -X POST https://prod.sparki.tools/api/projects/test/detect

# 3. Monitor Execution
# Check Storm dashboards for metrics
# - Detection completed successfully
# - No pipeline generation errors

SLO Verification

# 1. Check Error Budget
# Navigate to Grafana → Reliability SLO dashboard
# Verify: Budget remaining > 95%

# 2. Check Burn Rate
# Verify: 30-min burn rate < 1x
# Verify: 24-hour burn rate < 1x

# 3. Check Key Metrics
# P50 latency: < 100ms
# P95 latency: < 1s
# P99 latency: < 5s
# Error rate: < 0.1%

4. Rollback Procedure (On Failure)

Decision Tree

If deployment fails during infrastructure provisioning:

terraform destroy -var-file="environments/prod.tfvars" -auto-approve
# Restart at Phase 1 after fixing issues

If deployment fails during application deployment:

./infrastructure/scripts/rollback.sh prod
# This automatically reverts to previous version

If deployment passes but shows high errors:

# Give system 5 minutes to stabilize
# Monitor error rate in Storm Command Center

# If error rate > 1% sustained for 5 min:
./infrastructure/scripts/rollback.sh prod

# If issues continue after rollback, follow escalation

Rollback Verification

# 1. Verify Previous Version Active
kubectl get deployments -n sparki-engine -o jsonpath='{.items[0].spec.template.spec.containers[0].image}'

# 2. Verify Health Checks Pass
./infrastructure/scripts/health-check.sh prod

# 3. Monitor Error Rate
# Should return to < 0.1% within 2 minutes

5. Post-Deployment Communication

Success Notification

✅ Production deployment successful!

Version: [git-sha]
Deployed at: [timestamp]
Deployments: engine, web
Status: All healthy

Error Budget: [X]% consumed
Error Rate: [Y]%
P99 Latency: [Z]ms

Next: Monitor for 1 hour

Failure Notification

⚠️ Production deployment rolled back

Version: [git-sha]
Reason: [error description]
Rolled back to: [previous version]
Status: All healthy

Action: Platform team investigating
Next: Post-incident review in 24h

6. Monitoring Schedule (Post-Deployment)

First Hour

Monitor every 5 minutes
Check error rate, latency, logs
Alert on any anomalies

First 4 Hours

Monitor every 15 minutes
Check all dashboards
Verify SLO tracking working

First 24 Hours

Monitor every hour
Check trend analysis
Compare against baseline

Beyond 24 Hours

Normal monitoring
Watch for delayed issues
Ready for hotfixes

7. Troubleshooting Guide

Issue: Database Connection Errors

# 1. Check Connection
kubectl exec <pod> -n sparki-engine -- \
  psql -h $DB_HOST -U $DB_USER -d sparki -c "SELECT 1"

# 2. Check Security Group
aws ec2 describe-security-groups \
  --group-ids sg-xxxxx \
  --query 'SecurityGroups[0].IpPermissions'

# 3. Check Network
kubectl exec <pod> -n sparki-engine -- \
  curl -v postgresql://$DB_HOST:5432

# 4. Check Secrets
kubectl get secrets -n sparki-engine
kubectl describe secret db-credentials -n sparki-engine

Issue: High Error Rate

# 1. Check Application Logs
kubectl logs deployment/sparki-engine -n sparki-engine --tail=100 | grep ERROR

# 2. Check Jaeger Traces
# Filter for errors in Jaeger UI
# Look for common error patterns

# 3. Check Pod Status
kubectl describe pod <pod-name> -n sparki-engine

# 4. Check Resource Limits
kubectl top pods -n sparki-engine
kubectl describe node <node>

Issue: Slow Latency

# 1. Check P99 Latency
# Storm dashboard: Reliability SLO → SLO Compliance panel

# 2. Check Query Performance
# Elasticsearch: Search for slow queries
# Look for duration_ms > 1000

# 3. Check Database Connections
# RDS console: Check connection count vs max

# 4. Check Cache Hit Rate
# Storm dashboard: Look for cache metrics

Success Criteria Summary

Metric	Threshold	Check
Pod Health	100% running	`kubectl get pods`
Error Rate	< 0.1%	Storm Command Center
P99 Latency	< 5s	Storm Reliability Dashboard
Error Budget	> 99%	Grafana SLO dashboard
Database	Healthy	Test connection
Cache	Healthy	Test GET/SET
DNS	Resolving	`nslookup prod.sparki.tools`

Contacts & Escalation

Role	Contact	Slack
Platform Lead	@alexarno	#platform
SRE On-Call	@on-call-sre	#incidents
Database Admin	@dba-team	#database
Network Admin	@network-team	#infrastructure

Document Version: 1.0
Last Updated: December 2025
Status: APPROVED FOR PRODUCTION

​Deployment Runbook: Initial Production Deployment

​Alert: First-Time Production Deployment

​1. Pre-Deployment Checklist (1 hour before)

​Infrastructure Readiness

​Application Readiness

​Communication Plan

​Observability Verification

​2. Deployment Execution (2-3 hours)

​Phase 1: Infrastructure Provisioning (30 min)

​Phase 2: Database Initialization (20 min)

​Phase 3: Application Deployment (40 min)

​Phase 4: Health Verification (20 min)

​Phase 5: Monitoring Activation (10 min)

​3. Post-Deployment Validation (30 min)

​Automated Tests

​Manual Verification

​SLO Verification

​4. Rollback Procedure (On Failure)

​Decision Tree

​Rollback Verification

​5. Post-Deployment Communication

​Success Notification

​Failure Notification

​6. Monitoring Schedule (Post-Deployment)

​First Hour

​First 4 Hours

​First 24 Hours

​Beyond 24 Hours

​7. Troubleshooting Guide

​Issue: Database Connection Errors

​Issue: High Error Rate

​Issue: Slow Latency

​Success Criteria Summary

​Contacts & Escalation

Deployment Runbook: Initial Production Deployment

Alert: First-Time Production Deployment

1. Pre-Deployment Checklist (1 hour before)

Infrastructure Readiness

Application Readiness

Communication Plan

Observability Verification

2. Deployment Execution (2-3 hours)

Phase 1: Infrastructure Provisioning (30 min)

Phase 2: Database Initialization (20 min)

Phase 3: Application Deployment (40 min)

Phase 4: Health Verification (20 min)

Phase 5: Monitoring Activation (10 min)

3. Post-Deployment Validation (30 min)

Automated Tests

Manual Verification

SLO Verification

4. Rollback Procedure (On Failure)

Decision Tree

Rollback Verification

5. Post-Deployment Communication

Success Notification

Failure Notification

6. Monitoring Schedule (Post-Deployment)

First Hour

First 4 Hours

First 24 Hours

Beyond 24 Hours

7. Troubleshooting Guide

Issue: Database Connection Errors

Issue: High Error Rate

Issue: Slow Latency

Success Criteria Summary

Contacts & Escalation