Runbook: Blue-Green Deployment Strategy
Overview
Blue-Green deployment enables zero-downtime deployments by maintaining two identical production environments:- Blue: Current active environment (receives traffic)
- Green: New environment (staging the new version)
1. Pre-Deployment Preparation
Environment Setup
Copy
# 1. Set Variables
CURRENT_VERSION=$(kubectl get deployment sparki-engine -n sparki-engine \
-o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2)
NEW_VERSION="v1.2.3" # New version to deploy
NAMESPACE="sparki-engine"
TIMEOUT=600 # 10 minutes
# 2. Determine Blue/Green Status
ACTIVE_COLOR=$(kubectl get service sparki-engine-lb -n $NAMESPACE \
-o jsonpath='{.spec.selector.color}')
STANDBY_COLOR=$([ "$ACTIVE_COLOR" = "blue" ] && echo "green" || echo "blue")
echo "Current Active (Blue): $ACTIVE_COLOR"
echo "Standby (Green): $STANDBY_COLOR"
echo "Deploying to: $STANDBY_COLOR"
Pre-Deployment Checks
Copy
# 1. Verify Current System Health
./infrastructure/scripts/health-check.sh prod
# 2. Backup Current Configuration
kubectl get deployment sparki-engine -n $NAMESPACE \
-o yaml > deployment-backup-$CURRENT_VERSION.yaml
# 3. Verify New Image Exists
aws ecr describe-images \
--repository-name sparki/engine \
--image-ids imageTag=$NEW_VERSION
# 4. Validate Helm Charts
helm template sparki ./helm-charts/engine \
--values values-prod.yaml \
--set image.tag=$NEW_VERSION > /tmp/manifest-validation.yaml
# 5. Check Database Compatibility
# Verify migration scripts included in new version
# Check MIGRATION_REQUIRED flag in release notes
2. Deploy to Standby (Green)
Scale Green Environment
Copy
# 1. Create Green Deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: sparki-engine-green
namespace: $NAMESPACE
labels:
app: sparki-engine
color: green
spec:
replicas: 3
selector:
matchLabels:
app: sparki-engine
color: green
template:
metadata:
labels:
app: sparki-engine
color: green
spec:
containers:
- name: engine
image: ghcr.io/alexarno/sparki/engine:$NEW_VERSION
ports:
- containerPort: 8080
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
- name: ENVIRONMENT
value: production
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- sparki-engine
topologyKey: kubernetes.io/hostname
EOF
# 2. Wait for Green Deployment
kubectl rollout status deployment/sparki-engine-green \
-n $NAMESPACE --timeout=${TIMEOUT}s
# 3. Verify Green Pods Ready
READY_PODS=$(kubectl get deployment sparki-engine-green -n $NAMESPACE \
-o jsonpath='{.status.readyReplicas}')
echo "Green pods ready: $READY_PODS / 3"
Smoke Tests on Green
Copy
# 1. Get Green Service Endpoint
GREEN_SERVICE=$(kubectl get svc sparki-engine-green -n $NAMESPACE \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
# 2. Run Health Checks
echo "Testing health endpoint..."
curl -v http://$GREEN_SERVICE:8080/health
# 3. Run API Tests
echo "Testing API endpoints..."
curl -X GET http://$GREEN_SERVICE:8080/api/projects \
-H "Authorization: Bearer $TEST_TOKEN"
# 4. Run Integration Tests
echo "Running integration tests..."
kubectl run green-tests \
--image=ghcr.io/alexarno/sparki/e2e:latest \
-n sparki-test \
--env="APP_URL=http://$GREEN_SERVICE:8080" \
--wait \
--command -- bash -c "npm test -- --testNamePattern='smoke'"
# 5. Verify Test Results
TEST_EXIT_CODE=$?
if [ $TEST_EXIT_CODE -ne 0 ]; then
echo "❌ Green tests failed!"
# See Rollback section below
exit 1
fi
echo "✅ Green tests passed!"
Database Migration (if needed)
Copy
# 1. Check if Migration Required
REQUIRES_MIGRATION=$(docker inspect ghcr.io/alexarno/sparki/engine:$NEW_VERSION \
| grep -i "REQUIRES_MIGRATION")
if [ ! -z "$REQUIRES_MIGRATION" ]; then
echo "Running database migrations..."
# 2. Create Migration Job
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration-$NEW_VERSION
namespace: $NAMESPACE
spec:
template:
spec:
containers:
- name: migrate
image: ghcr.io/alexarno/sparki/engine:$NEW_VERSION
command: ["./migrate", "up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
restartPolicy: Never
EOF
# 3. Wait for Migration
kubectl wait --for=condition=complete job/db-migration-$NEW_VERSION \
-n $NAMESPACE --timeout=${TIMEOUT}s
# 4. Check Migration Logs
kubectl logs job/db-migration-$NEW_VERSION -n $NAMESPACE
# 5. Verify Migration Success
MIGRATION_EXIT_CODE=$?
if [ $MIGRATION_EXIT_CODE -ne 0 ]; then
echo "❌ Database migration failed!"
exit 1
fi
fi
3. Validation Phase (Parallel Testing)
Performance Baseline
Copy
# 1. Generate Load on Green
kubectl run load-test \
--image=ghcr.io/alexarno/sparki/loadtest:latest \
-n sparki-test \
--env="TARGET_URL=http://$GREEN_SERVICE:8080" \
--env="DURATION=300" \
--env="RPS=100" \
--detach &
# 2. Monitor Metrics
# Open Grafana dashboard: Engine Performance (Green)
# Watch:
# - Request rate (should be ~100 RPS)
# - Error rate (should be 0%)
# - P99 latency (should be < 5s)
# - CPU usage (should be < 70%)
# - Memory usage (should be < 80%)
sleep 60 # Let metrics stabilize
# 3. Collect Baseline Metrics
LOAD_TEST_PID=$!
wait $LOAD_TEST_PID
# 4. Verify Stability
echo "Waiting for stability after load test..."
sleep 30
Canary Traffic (Optional 5% Step)
Copy
# Skip this step if load tests passed
# Use only if concerned about new version stability
# 1. Create Canary Service
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: sparki-engine-canary
namespace: $NAMESPACE
spec:
type: LoadBalancer
ports:
- port: 8080
targetPort: 8080
selector:
app: sparki-engine
color: green
EOF
# 2. Route 5% Traffic to Green via Ingress
# Update Ingress configuration:
# - Blue (main): 95%
# - Green (canary): 5%
# 3. Monitor for 10 minutes
for i in {1..20}; do
echo "Canary Check $i/20..."
ERROR_RATE=$(kubectl logs deployment/sparki-engine-green -n $NAMESPACE \
--tail=100 | grep -c ERROR || true)
if [ $ERROR_RATE -gt 5 ]; then
echo "❌ High error rate detected in canary traffic!"
exit 1
fi
sleep 30
done
echo "✅ Canary traffic passed!"
4. Traffic Switch (Atomic)
Prepare Switch
Copy
# 1. Final Health Check on Both Colors
echo "Blue deployment status:"
kubectl get deployment sparki-engine-blue -n $NAMESPACE
echo "Green deployment status:"
kubectl get deployment sparki-engine-green -n $NAMESPACE
# 2. Final Verification
echo "Blue health:"
curl http://sparki-engine-blue:8080/health
echo "Green health:"
curl http://sparki-engine-green:8080/health
# 3. Create Backup Before Switch
kubectl get service sparki-engine-lb -n $NAMESPACE \
-o yaml > service-backup-before-switch.yaml
Execute Switch
Copy
# 1. Update Service to Route to Green
echo "Switching traffic from $ACTIVE_COLOR to $STANDBY_COLOR..."
kubectl patch service sparki-engine-lb \
-n $NAMESPACE \
-p '{"spec":{"selector":{"color":"'$STANDBY_COLOR'"}}}'
echo "✅ Traffic switch complete!"
SWITCH_TIME=$(date +%s)
# 2. Verify Service Updated
kubectl get svc sparki-engine-lb -n $NAMESPACE \
-o jsonpath='{.spec.selector.color}'
Immediate Post-Switch Monitoring (5 minutes)
Copy
# 1. Monitor Error Rate (CRITICAL)
for i in {1..10}; do
ERROR_RATE=$(kubectl top nodes -o jsonpath='{.items[0].status.capacity.cpu}')
echo "Error check $i/10..."
# Check logs for errors
ERROR_COUNT=$(kubectl logs deployment/sparki-engine-green -n $NAMESPACE \
--tail=50 | grep -c "ERROR" || true)
if [ $ERROR_COUNT -gt 10 ]; then
echo "⚠️ High error rate detected!"
echo "INITIATING ROLLBACK..."
kubectl patch service sparki-engine-lb \
-n $NAMESPACE \
-p '{"spec":{"selector":{"color":"'$ACTIVE_COLOR'"}}}'
echo "✅ Rolled back to $ACTIVE_COLOR"
exit 1
fi
sleep 30
done
echo "✅ First 5 minutes stable!"
5. Stability Validation (2 hours)
Extended Monitoring
Copy
# 1. Create Monitoring Dashboard
cat > /tmp/post-switch-monitoring.sh <<'MONITOR'
#!/bin/bash
NAMESPACE="sparki-engine"
MONITORING_DURATION=7200 # 2 hours
CHECK_INTERVAL=60 # Every 1 minute
START_TIME=$(date +%s)
ALERT_THRESHOLD_ERROR_RATE=0.5 # 0.5%
ALERT_THRESHOLD_LATENCY=10000 # 10 seconds (P99)
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
MINUTES=$((ELAPSED / 60))
# 1. Check Error Rate
ERROR_RATE=$(kubectl logs deployment/sparki-engine-green -n $NAMESPACE \
--since=1m --tail=1000 | grep -c "ERROR" || true)
if [ $ERROR_RATE -gt $ALERT_THRESHOLD_ERROR_RATE ]; then
echo "⚠️ [$MINUTES min] Error rate elevated: $ERROR_RATE%"
fi
# 2. Check Resource Usage
CPU=$(kubectl top deployment sparki-engine-green -n $NAMESPACE \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
# 3. Check Database Connections
DB_CONNS=$(kubectl exec -n $NAMESPACE \
-it $(kubectl get pod -n $NAMESPACE -l color=green -o name | head -1) \
-- curl -s http://localhost:8080/metrics | grep "db_connections" || true)
# 4. Log Status
echo "[$(date '+%H:%M:%S')] $MINUTES min - Error Rate: ${ERROR_RATE}% | CPU: $CPU | DB Conn: $DB_CONNS"
if [ $ELAPSED -ge $MONITORING_DURATION ]; then
echo "✅ 2-hour stability window passed!"
break
fi
sleep $CHECK_INTERVAL
done
MONITOR
chmod +x /tmp/post-switch-monitoring.sh
/tmp/post-switch-monitoring.sh &
MONITOR_PID=$!
# 2. Monitor Specific Metrics
echo "Watching key metrics for 2 hours..."
echo "Check Grafana for:"
echo " - Command Center → Error Rate (should be < 0.1%)"
echo " - Reliability SLO → Burn Rate (should be < 1x)"
echo " - Pipeline Execution → Queue Depth (should be stable)"
echo " - Debugging → Trace Search (should have normal patterns)"
# 3. Alert Configuration
echo "Alerts active:"
echo " - api_error_rate: > 0.5%"
echo " - api_p99_latency: > 10s"
echo " - database_connection_pool_exhausted"
echo " - redis_connection_errors"
Validation Checkpoints
| Time | Check | Success Criteria |
|---|---|---|
| +5 min | Immediate stability | Error rate < 0.1%, no crash loops |
| +15 min | Performance baseline | P99 latency < 5s, CPU < 70% |
| +1 hour | Extended stability | All metrics stable, no alerting |
| +2 hours | Production readiness | Full SLO compliance achieved |
6. Old Environment Cleanup
After 2-hour Validation
Copy
# 1. Verify New (Green) is Stable
CURRENT_ERROR_RATE=$(kubectl logs deployment/sparki-engine-green \
-n $NAMESPACE --since=1h --tail=5000 | grep -c "ERROR" || true)
if [ $CURRENT_ERROR_RATE -gt 1.0 ]; then
echo "❌ Error rate elevated! Not cleaning up old environment yet."
exit 1
fi
# 2. Scale Down Old (Blue) Environment
echo "Scaling down old $ACTIVE_COLOR environment..."
kubectl scale deployment sparki-engine-blue \
-n $NAMESPACE \
--replicas=0
# 3. Keep Backup for 24 Hours
echo "Old deployment can be restored from backup for 24 hours:"
echo " kubectl rollout undo deployment/sparki-engine-blue -n $NAMESPACE"
# 4. Label for Cleanup
kubectl annotate deployment sparki-engine-blue \
-n $NAMESPACE \
cleanup-after=$(date -u -d "+24 hours" +%Y-%m-%dT%H:%M:%SZ) \
--overwrite
7. Rollback Procedures
Immediate Rollback (Within Minutes)
Copy
#!/bin/bash
# Use if issues detected within 5 minutes of switch
NAMESPACE="sparki-engine"
ACTIVE_COLOR=$(kubectl get service sparki-engine-lb -n $NAMESPACE \
-o jsonpath='{.spec.selector.color}')
echo "⚠️ INITIATING IMMEDIATE ROLLBACK..."
echo "Rolling back from $ACTIVE_COLOR..."
# 1. Switch Traffic Back Instantly
kubectl patch service sparki-engine-lb \
-n $NAMESPACE \
-p '{"spec":{"selector":{"color":"blue"}}}'
echo "✅ Traffic switched back to blue"
# 2. Monitor Previous Deployment
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE --tail=50
# 3. Verify Rollback Successful
sleep 30
curl http://sparki-engine-lb:8080/health
echo "✅ Rollback complete. Assess issues and prepare fix."
Delayed Rollback (After Hours)
Copy
#!/bin/bash
# Use if issues emerge after monitoring period
NAMESPACE="sparki-engine"
echo "⚠️ INITIATING DELAYED ROLLBACK..."
# 1. Scale Down Green (Current Bad Deployment)
kubectl scale deployment sparki-engine-green -n $NAMESPACE --replicas=0
# 2. Scale Up Blue (Rollback Deployment)
kubectl scale deployment sparki-engine-blue -n $NAMESPACE --replicas=3
# 3. Wait for Blue to be Ready
kubectl rollout status deployment/sparki-engine-blue \
-n $NAMESPACE --timeout=600s
# 4. Update Service
kubectl patch service sparki-engine-lb \
-n $NAMESPACE \
-p '{"spec":{"selector":{"color":"blue"}}}'
echo "✅ Rolled back to blue deployment"
# 5. Assess Issues
echo "Check the following:"
echo " - Application error logs"
echo " - Jaeger traces for failed requests"
echo " - Metrics for anomalies"
echo " - Database migration status (if applicable)"
8. Post-Rollback Actions
Analysis
Copy
# 1. Collect Logs
kubectl logs deployment/sparki-engine-green -n $NAMESPACE --all-containers=true \
> green-deployment-logs.txt
# 2. Export Traces
# Query Jaeger for errors during green deployment
# Export to analysis tool
# 3. Export Metrics
# Export Prometheus metrics for time window of green deployment
# 4. Create Incident Report
cat > incident-report.md <<EOF
# Incident Report: Failed Deployment
## Summary
Deployment of version $NEW_VERSION rolled back after $ROLLBACK_TIME minutes.
## Root Cause
[Investigation findings]
## Timeline
- [Start]: Version deployment began
- [Issue]: First error detected
- [Action]: Rollback initiated
- [Recovery]: Service restored
## Impact
- Duration: $DURATION minutes
- Users affected: [estimate]
- SLO impact: [SLO calculation]
## Resolution
[What fixed the issue]
## Prevention
[What to do next time]
## Action Items
- [ ] Fix identified issue
- [ ] Add monitoring for this scenario
- [ ] Update documentation
- [ ] Conduct post-incident review
EOF
9. Success Criteria Checklist
- Green deployment created and running
- Green smoke tests passing
- Database migrations completed (if applicable)
- Canary traffic passed (if applicable)
- Load tests showed acceptable performance
- Traffic switched successfully
- Error rate < 0.1% post-switch
- No crash loops post-switch
- P99 latency acceptable post-switch
- 5-minute stability window passed
- 2-hour stability window passed
- All SLOs being met
- All dashboards showing normal metrics
- Old deployment scaled down
- Backup retention configured
Quick Reference Commands
Copy
# View Current Status
kubectl get deployments -n sparki-engine
kubectl get svc sparki-engine-lb -n sparki-engine
# View Active Color
kubectl get svc sparki-engine-lb -n sparki-engine \
-o jsonpath='{.spec.selector.color}'
# Check Green Logs
kubectl logs deployment/sparki-engine-green -n sparki-engine -f
# Switch to Green (for normal deploy)
kubectl patch service sparki-engine-lb -n sparki-engine \
-p '{"spec":{"selector":{"color":"green"}}}'
# Switch to Blue (for emergency rollback)
kubectl patch service sparki-engine-lb -n sparki-engine \
-p '{"spec":{"selector":{"color":"blue"}}}'
# Delete Old Deployment
kubectl delete deployment sparki-engine-blue -n sparki-engine
Runbook Version: 1.0
Last Updated: December 2025
Approved By: Platform Engineering
Next Review: March 2026