Skip to main content

Deployment Runbook: Initial Production Deployment

Alert: First-Time Production Deployment

Severity: HIGH
Duration: 2-4 hours
Team: Platform Engineering + SRE

1. Pre-Deployment Checklist (1 hour before)

Infrastructure Readiness

  • Terraform plan reviewed and approved
  • All AWS resources validated (VPC, EKS, RDS, Redis)
  • Database backups verified and tested
  • DNS records prepared (ready for cutover)
  • TLS certificates obtained and verified
  • Load balancer configured and health checks passing
  • All security groups properly configured
  • IAM roles and policies reviewed

Application Readiness

  • All tests passing (unit, integration, e2e)
  • Code review completed and merged to main
  • Docker images built and pushed to registry
  • Helm charts validated against production environment
  • Configuration secrets validated in AWS Secrets Manager
  • Database migrations tested and verified
  • Feature flags configured and tested
  • Monitoring dashboards created in Storm

Communication Plan

  • Announce maintenance window in Slack #status
  • Brief support team on expected behavior
  • Create incident channel for coordination
  • On-call engineers confirmed and briefed
  • Customer communication drafted (if applicable)

Observability Verification

  • Storm observability stack healthy (all 8 services)
  • Prometheus scraping targets verified
  • Grafana dashboards loaded successfully
  • Jaeger collector accepting traces
  • ELK Stack receiving logs
  • AlertManager routing alerts correctly
  • Sample alerts tested and verified

2. Deployment Execution (2-3 hours)

Phase 1: Infrastructure Provisioning (30 min)

# 1. Prepare Terraform
cd infrastructure/terraform
terraform init -upgrade
terraform fmt -recursive
terraform validate

# 2. Plan Infrastructure
terraform plan \
  -var-file="environments/prod.tfvars" \
  -out=tfplan

# 3. Review Plan Output
# ❗ MUST review all changes before proceeding

# 4. Apply Infrastructure
terraform apply tfplan

# 5. Capture Outputs
terraform output -json > outputs.json
Success Criteria:
  • All Terraform resources created without errors
  • VPC with public/private subnets operational
  • EKS cluster healthy with all nodes ready
  • RDS database created and accessible
  • Redis cluster created and accessible
  • Security groups properly configured

Phase 2: Database Initialization (20 min)

# 1. Get Database Endpoint
DB_ENDPOINT=$(terraform output -raw database_endpoint)

# 2. Run Migrations
kubectl run migration-job \
  --image=ghcr.io/alexarno/sparki/engine:latest \
  --env="DATABASE_URL=postgres://${DB_USER}:${DB_PASS}@${DB_ENDPOINT}:5432/sparki" \
  -n sparki-engine \
  --command -- /app/migrate up

# 3. Verify Migrations
kubectl logs -f job/migration-job -n sparki-engine

# 4. Create Initial Data
kubectl run seed-job \
  --image=ghcr.io/alexarno/sparki/engine:latest \
  -n sparki-engine \
  --command -- /app/seed-data
Success Criteria:
  • All database migrations completed
  • Schema validated
  • Seed data loaded
  • Database accessible from pods

Phase 3: Application Deployment (40 min)

# 1. Deploy Engine
./infrastructure/scripts/deploy.sh prod latest

# 2. Deploy Web UI
./infrastructure/scripts/deploy-web.sh prod latest

# 3. Verify Deployments
kubectl get deployments -n sparki-engine
kubectl get deployments -n sparki-web
kubectl get pods -n sparki-engine
kubectl get pods -n sparki-web

# 4. Check Logs for Errors
kubectl logs -f deployment/sparki-engine -n sparki-engine --tail=50
kubectl logs -f deployment/sparki-web -n sparki-web --tail=50
Success Criteria:
  • All pods running and ready
  • No image pull errors
  • No crash loops
  • Application logs show successful startup

Phase 4: Health Verification (20 min)

# 1. Run Health Checks
./infrastructure/scripts/health-check.sh prod

# 2. Test API Endpoints
POD=$(kubectl get pods -n sparki-engine -o name | head -1)
kubectl exec $POD -n sparki-engine -- \
  curl -v http://localhost:8080/health

# 3. Test Database Connectivity
kubectl exec $POD -n sparki-engine -- \
  curl -v http://localhost:8080/api/projects

# 4. Test Cache Connectivity
kubectl exec $POD -n sparki-engine -- \
  curl -v http://localhost:8080/api/cache/health
Success Criteria:
  • API responds to health checks
  • Database queries return results
  • Cache is accessible
  • No 5xx errors in logs

Phase 5: Monitoring Activation (10 min)

# 1. Update DNS (if not pre-provisioned)
# Update DNS to point to new load balancer

# 2. Monitor Error Rate
# Check Storm Command Center dashboard
# - Error rate should be < 0.1%
# - P99 latency should be < 5s

# 3. Monitor Log Volume
# Check Kibana for any ERROR level logs
# Review recent ERROR entries

# 4. Test from Outside
curl -v https://prod.sparki.tools/health
Success Criteria:
  • Low error rate (< 0.1%)
  • Normal latency (P99 < 5s)
  • No 5xx errors in logs
  • External connectivity working

3. Post-Deployment Validation (30 min)

Automated Tests

# 1. Run Smoke Tests
./infrastructure/scripts/smoke-tests.sh prod

# 2. Run E2E Tests
kubectl run e2e-tests \
  --image=ghcr.io/alexarno/sparki/e2e:latest \
  -n sparki-test \
  --env="APP_URL=https://prod.sparki.tools" \
  --wait

Manual Verification

# 1. Create Test Project
curl -X POST https://prod.sparki.tools/api/projects \
  -H "Content-Type: application/json" \
  -d '{"name": "test-project"}'

# 2. Trigger Detection
curl -X POST https://prod.sparki.tools/api/projects/test/detect

# 3. Monitor Execution
# Check Storm dashboards for metrics
# - Detection completed successfully
# - No pipeline generation errors

SLO Verification

# 1. Check Error Budget
# Navigate to Grafana → Reliability SLO dashboard
# Verify: Budget remaining > 95%

# 2. Check Burn Rate
# Verify: 30-min burn rate < 1x
# Verify: 24-hour burn rate < 1x

# 3. Check Key Metrics
# P50 latency: < 100ms
# P95 latency: < 1s
# P99 latency: < 5s
# Error rate: < 0.1%

4. Rollback Procedure (On Failure)

Decision Tree

If deployment fails during infrastructure provisioning:
terraform destroy -var-file="environments/prod.tfvars" -auto-approve
# Restart at Phase 1 after fixing issues
If deployment fails during application deployment:
./infrastructure/scripts/rollback.sh prod
# This automatically reverts to previous version
If deployment passes but shows high errors:
# Give system 5 minutes to stabilize
# Monitor error rate in Storm Command Center

# If error rate > 1% sustained for 5 min:
./infrastructure/scripts/rollback.sh prod

# If issues continue after rollback, follow escalation

Rollback Verification

# 1. Verify Previous Version Active
kubectl get deployments -n sparki-engine -o jsonpath='{.items[0].spec.template.spec.containers[0].image}'

# 2. Verify Health Checks Pass
./infrastructure/scripts/health-check.sh prod

# 3. Monitor Error Rate
# Should return to < 0.1% within 2 minutes

5. Post-Deployment Communication

Success Notification

✅ Production deployment successful!

Version: [git-sha]
Deployed at: [timestamp]
Deployments: engine, web
Status: All healthy

Error Budget: [X]% consumed
Error Rate: [Y]%
P99 Latency: [Z]ms

Next: Monitor for 1 hour

Failure Notification

⚠️ Production deployment rolled back

Version: [git-sha]
Reason: [error description]
Rolled back to: [previous version]
Status: All healthy

Action: Platform team investigating
Next: Post-incident review in 24h

6. Monitoring Schedule (Post-Deployment)

First Hour

  • Monitor every 5 minutes
  • Check error rate, latency, logs
  • Alert on any anomalies

First 4 Hours

  • Monitor every 15 minutes
  • Check all dashboards
  • Verify SLO tracking working

First 24 Hours

  • Monitor every hour
  • Check trend analysis
  • Compare against baseline

Beyond 24 Hours

  • Normal monitoring
  • Watch for delayed issues
  • Ready for hotfixes

7. Troubleshooting Guide

Issue: Database Connection Errors

# 1. Check Connection
kubectl exec <pod> -n sparki-engine -- \
  psql -h $DB_HOST -U $DB_USER -d sparki -c "SELECT 1"

# 2. Check Security Group
aws ec2 describe-security-groups \
  --group-ids sg-xxxxx \
  --query 'SecurityGroups[0].IpPermissions'

# 3. Check Network
kubectl exec <pod> -n sparki-engine -- \
  curl -v postgresql://$DB_HOST:5432

# 4. Check Secrets
kubectl get secrets -n sparki-engine
kubectl describe secret db-credentials -n sparki-engine

Issue: High Error Rate

# 1. Check Application Logs
kubectl logs deployment/sparki-engine -n sparki-engine --tail=100 | grep ERROR

# 2. Check Jaeger Traces
# Filter for errors in Jaeger UI
# Look for common error patterns

# 3. Check Pod Status
kubectl describe pod <pod-name> -n sparki-engine

# 4. Check Resource Limits
kubectl top pods -n sparki-engine
kubectl describe node <node>

Issue: Slow Latency

# 1. Check P99 Latency
# Storm dashboard: Reliability SLO → SLO Compliance panel

# 2. Check Query Performance
# Elasticsearch: Search for slow queries
# Look for duration_ms > 1000

# 3. Check Database Connections
# RDS console: Check connection count vs max

# 4. Check Cache Hit Rate
# Storm dashboard: Look for cache metrics

Success Criteria Summary

MetricThresholdCheck
Pod Health100% runningkubectl get pods
Error Rate< 0.1%Storm Command Center
P99 Latency< 5sStorm Reliability Dashboard
Error Budget> 99%Grafana SLO dashboard
DatabaseHealthyTest connection
CacheHealthyTest GET/SET
DNSResolvingnslookup prod.sparki.tools

Contacts & Escalation

RoleContactSlack
Platform Lead@alexarno#platform
SRE On-Call@on-call-sre#incidents
Database Admin@dba-team#database
Network Admin@network-team#infrastructure

Document Version: 1.0
Last Updated: December 2025
Status: APPROVED FOR PRODUCTION