Runbook: Blue-Green Deployment Strategy

Overview

Blue-Green deployment enables zero-downtime deployments by maintaining two identical production environments:

Blue: Current active environment (receives traffic)
Green: New environment (staging the new version)

Traffic switches atomically after validation, allowing instant rollback.

1. Pre-Deployment Preparation

Environment Setup

# 1. Set Variables
CURRENT_VERSION=$(kubectl get deployment sparki-engine -n sparki-engine \
  -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2)
NEW_VERSION="v1.2.3"  # New version to deploy
NAMESPACE="sparki-engine"
TIMEOUT=600  # 10 minutes

# 2. Determine Blue/Green Status
ACTIVE_COLOR=$(kubectl get service sparki-engine-lb -n $NAMESPACE \
  -o jsonpath='{.spec.selector.color}')
STANDBY_COLOR=$([ "$ACTIVE_COLOR" = "blue" ] && echo "green" || echo "blue")

echo "Current Active (Blue): $ACTIVE_COLOR"
echo "Standby (Green): $STANDBY_COLOR"
echo "Deploying to: $STANDBY_COLOR"

Pre-Deployment Checks

# 1. Verify Current System Health
./infrastructure/scripts/health-check.sh prod

# 2. Backup Current Configuration
kubectl get deployment sparki-engine -n $NAMESPACE \
  -o yaml > deployment-backup-$CURRENT_VERSION.yaml

# 3. Verify New Image Exists
aws ecr describe-images \
  --repository-name sparki/engine \
  --image-ids imageTag=$NEW_VERSION

# 4. Validate Helm Charts
helm template sparki ./helm-charts/engine \
  --values values-prod.yaml \
  --set image.tag=$NEW_VERSION > /tmp/manifest-validation.yaml

# 5. Check Database Compatibility
# Verify migration scripts included in new version
# Check MIGRATION_REQUIRED flag in release notes

2. Deploy to Standby (Green)

Scale Green Environment

# 1. Create Green Deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sparki-engine-green
  namespace: $NAMESPACE
  labels:
    app: sparki-engine
    color: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sparki-engine
      color: green
  template:
    metadata:
      labels:
        app: sparki-engine
        color: green
    spec:
      containers:
      - name: engine
        image: ghcr.io/alexarno/sparki/engine:$NEW_VERSION
        ports:
        - containerPort: 8080
        env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: host
        - name: ENVIRONMENT
          value: production
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - sparki-engine
              topologyKey: kubernetes.io/hostname
EOF

# 2. Wait for Green Deployment
kubectl rollout status deployment/sparki-engine-green \
  -n $NAMESPACE --timeout=${TIMEOUT}s

# 3. Verify Green Pods Ready
READY_PODS=$(kubectl get deployment sparki-engine-green -n $NAMESPACE \
  -o jsonpath='{.status.readyReplicas}')
echo "Green pods ready: $READY_PODS / 3"

Smoke Tests on Green

# 1. Get Green Service Endpoint
GREEN_SERVICE=$(kubectl get svc sparki-engine-green -n $NAMESPACE \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# 2. Run Health Checks
echo "Testing health endpoint..."
curl -v http://$GREEN_SERVICE:8080/health

# 3. Run API Tests
echo "Testing API endpoints..."
curl -X GET http://$GREEN_SERVICE:8080/api/projects \
  -H "Authorization: Bearer $TEST_TOKEN"

# 4. Run Integration Tests
echo "Running integration tests..."
kubectl run green-tests \
  --image=ghcr.io/alexarno/sparki/e2e:latest \
  -n sparki-test \
  --env="APP_URL=http://$GREEN_SERVICE:8080" \
  --wait \
  --command -- bash -c "npm test -- --testNamePattern='smoke'"

# 5. Verify Test Results
TEST_EXIT_CODE=$?
if [ $TEST_EXIT_CODE -ne 0 ]; then
  echo "❌ Green tests failed!"
  # See Rollback section below
  exit 1
fi
echo "✅ Green tests passed!"

Database Migration (if needed)

# 1. Check if Migration Required
REQUIRES_MIGRATION=$(docker inspect ghcr.io/alexarno/sparki/engine:$NEW_VERSION \
  | grep -i "REQUIRES_MIGRATION")

if [ ! -z "$REQUIRES_MIGRATION" ]; then
  echo "Running database migrations..."

  # 2. Create Migration Job
  kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-$NEW_VERSION
  namespace: $NAMESPACE
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: ghcr.io/alexarno/sparki/engine:$NEW_VERSION
        command: ["./migrate", "up"]
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
      restartPolicy: Never
EOF

  # 3. Wait for Migration
  kubectl wait --for=condition=complete job/db-migration-$NEW_VERSION \
    -n $NAMESPACE --timeout=${TIMEOUT}s

  # 4. Check Migration Logs
  kubectl logs job/db-migration-$NEW_VERSION -n $NAMESPACE

  # 5. Verify Migration Success
  MIGRATION_EXIT_CODE=$?
  if [ $MIGRATION_EXIT_CODE -ne 0 ]; then
    echo "❌ Database migration failed!"
    exit 1
  fi
fi

3. Validation Phase (Parallel Testing)

Performance Baseline

# 1. Generate Load on Green
kubectl run load-test \
  --image=ghcr.io/alexarno/sparki/loadtest:latest \
  -n sparki-test \
  --env="TARGET_URL=http://$GREEN_SERVICE:8080" \
  --env="DURATION=300" \
  --env="RPS=100" \
  --detach &

# 2. Monitor Metrics
# Open Grafana dashboard: Engine Performance (Green)
# Watch:
# - Request rate (should be ~100 RPS)
# - Error rate (should be 0%)
# - P99 latency (should be < 5s)
# - CPU usage (should be < 70%)
# - Memory usage (should be < 80%)

sleep 60  # Let metrics stabilize

# 3. Collect Baseline Metrics
LOAD_TEST_PID=$!
wait $LOAD_TEST_PID

# 4. Verify Stability
echo "Waiting for stability after load test..."
sleep 30

Canary Traffic (Optional 5% Step)

# Skip this step if load tests passed
# Use only if concerned about new version stability

# 1. Create Canary Service
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: sparki-engine-canary
  namespace: $NAMESPACE
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    targetPort: 8080
  selector:
    app: sparki-engine
    color: green
EOF

# 2. Route 5% Traffic to Green via Ingress
# Update Ingress configuration:
# - Blue (main): 95%
# - Green (canary): 5%

# 3. Monitor for 10 minutes
for i in {1..20}; do
  echo "Canary Check $i/20..."
  ERROR_RATE=$(kubectl logs deployment/sparki-engine-green -n $NAMESPACE \
    --tail=100 | grep -c ERROR || true)

  if [ $ERROR_RATE -gt 5 ]; then
    echo "❌ High error rate detected in canary traffic!"
    exit 1
  fi
  sleep 30
done

echo "✅ Canary traffic passed!"

4. Traffic Switch (Atomic)

Prepare Switch

# 1. Final Health Check on Both Colors
echo "Blue deployment status:"
kubectl get deployment sparki-engine-blue -n $NAMESPACE

echo "Green deployment status:"
kubectl get deployment sparki-engine-green -n $NAMESPACE

# 2. Final Verification
echo "Blue health:"
curl http://sparki-engine-blue:8080/health
echo "Green health:"
curl http://sparki-engine-green:8080/health

# 3. Create Backup Before Switch
kubectl get service sparki-engine-lb -n $NAMESPACE \
  -o yaml > service-backup-before-switch.yaml

Execute Switch

# 1. Update Service to Route to Green
echo "Switching traffic from $ACTIVE_COLOR to $STANDBY_COLOR..."

kubectl patch service sparki-engine-lb \
  -n $NAMESPACE \
  -p '{"spec":{"selector":{"color":"'$STANDBY_COLOR'"}}}'

echo "✅ Traffic switch complete!"
SWITCH_TIME=$(date +%s)

# 2. Verify Service Updated
kubectl get svc sparki-engine-lb -n $NAMESPACE \
  -o jsonpath='{.spec.selector.color}'

Immediate Post-Switch Monitoring (5 minutes)

# 1. Monitor Error Rate (CRITICAL)
for i in {1..10}; do
  ERROR_RATE=$(kubectl top nodes -o jsonpath='{.items[0].status.capacity.cpu}')
  echo "Error check $i/10..."

  # Check logs for errors
  ERROR_COUNT=$(kubectl logs deployment/sparki-engine-green -n $NAMESPACE \
    --tail=50 | grep -c "ERROR" || true)

  if [ $ERROR_COUNT -gt 10 ]; then
    echo "⚠️ High error rate detected!"
    echo "INITIATING ROLLBACK..."
    kubectl patch service sparki-engine-lb \
      -n $NAMESPACE \
      -p '{"spec":{"selector":{"color":"'$ACTIVE_COLOR'"}}}'
    echo "✅ Rolled back to $ACTIVE_COLOR"
    exit 1
  fi

  sleep 30
done

echo "✅ First 5 minutes stable!"

5. Stability Validation (2 hours)

Extended Monitoring

# 1. Create Monitoring Dashboard
cat > /tmp/post-switch-monitoring.sh <<'MONITOR'
#!/bin/bash
NAMESPACE="sparki-engine"
MONITORING_DURATION=7200  # 2 hours
CHECK_INTERVAL=60  # Every 1 minute

START_TIME=$(date +%s)
ALERT_THRESHOLD_ERROR_RATE=0.5  # 0.5%
ALERT_THRESHOLD_LATENCY=10000   # 10 seconds (P99)

while true; do
  CURRENT_TIME=$(date +%s)
  ELAPSED=$((CURRENT_TIME - START_TIME))
  MINUTES=$((ELAPSED / 60))

  # 1. Check Error Rate
  ERROR_RATE=$(kubectl logs deployment/sparki-engine-green -n $NAMESPACE \
    --since=1m --tail=1000 | grep -c "ERROR" || true)

  if [ $ERROR_RATE -gt $ALERT_THRESHOLD_ERROR_RATE ]; then
    echo "⚠️ [$MINUTES min] Error rate elevated: $ERROR_RATE%"
  fi

  # 2. Check Resource Usage
  CPU=$(kubectl top deployment sparki-engine-green -n $NAMESPACE \
    -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)

  # 3. Check Database Connections
  DB_CONNS=$(kubectl exec -n $NAMESPACE \
    -it $(kubectl get pod -n $NAMESPACE -l color=green -o name | head -1) \
    -- curl -s http://localhost:8080/metrics | grep "db_connections" || true)

  # 4. Log Status
  echo "[$(date '+%H:%M:%S')] $MINUTES min - Error Rate: ${ERROR_RATE}% | CPU: $CPU | DB Conn: $DB_CONNS"

  if [ $ELAPSED -ge $MONITORING_DURATION ]; then
    echo "✅ 2-hour stability window passed!"
    break
  fi

  sleep $CHECK_INTERVAL
done
MONITOR

chmod +x /tmp/post-switch-monitoring.sh
/tmp/post-switch-monitoring.sh &
MONITOR_PID=$!

# 2. Monitor Specific Metrics
echo "Watching key metrics for 2 hours..."
echo "Check Grafana for:"
echo "  - Command Center → Error Rate (should be < 0.1%)"
echo "  - Reliability SLO → Burn Rate (should be < 1x)"
echo "  - Pipeline Execution → Queue Depth (should be stable)"
echo "  - Debugging → Trace Search (should have normal patterns)"

# 3. Alert Configuration
echo "Alerts active:"
echo "  - api_error_rate: > 0.5%"
echo "  - api_p99_latency: > 10s"
echo "  - database_connection_pool_exhausted"
echo "  - redis_connection_errors"

Validation Checkpoints

Time	Check	Success Criteria
+5 min	Immediate stability	Error rate < 0.1%, no crash loops
+15 min	Performance baseline	P99 latency < 5s, CPU < 70%
+1 hour	Extended stability	All metrics stable, no alerting
+2 hours	Production readiness	Full SLO compliance achieved

6. Old Environment Cleanup

After 2-hour Validation

# 1. Verify New (Green) is Stable
CURRENT_ERROR_RATE=$(kubectl logs deployment/sparki-engine-green \
  -n $NAMESPACE --since=1h --tail=5000 | grep -c "ERROR" || true)

if [ $CURRENT_ERROR_RATE -gt 1.0 ]; then
  echo "❌ Error rate elevated! Not cleaning up old environment yet."
  exit 1
fi

# 2. Scale Down Old (Blue) Environment
echo "Scaling down old $ACTIVE_COLOR environment..."
kubectl scale deployment sparki-engine-blue \
  -n $NAMESPACE \
  --replicas=0

# 3. Keep Backup for 24 Hours
echo "Old deployment can be restored from backup for 24 hours:"
echo "  kubectl rollout undo deployment/sparki-engine-blue -n $NAMESPACE"

# 4. Label for Cleanup
kubectl annotate deployment sparki-engine-blue \
  -n $NAMESPACE \
  cleanup-after=$(date -u -d "+24 hours" +%Y-%m-%dT%H:%M:%SZ) \
  --overwrite

7. Rollback Procedures

Immediate Rollback (Within Minutes)

#!/bin/bash
# Use if issues detected within 5 minutes of switch

NAMESPACE="sparki-engine"
ACTIVE_COLOR=$(kubectl get service sparki-engine-lb -n $NAMESPACE \
  -o jsonpath='{.spec.selector.color}')

echo "⚠️ INITIATING IMMEDIATE ROLLBACK..."
echo "Rolling back from $ACTIVE_COLOR..."

# 1. Switch Traffic Back Instantly
kubectl patch service sparki-engine-lb \
  -n $NAMESPACE \
  -p '{"spec":{"selector":{"color":"blue"}}}'

echo "✅ Traffic switched back to blue"

# 2. Monitor Previous Deployment
kubectl logs deployment/sparki-engine-blue -n $NAMESPACE --tail=50

# 3. Verify Rollback Successful
sleep 30
curl http://sparki-engine-lb:8080/health

echo "✅ Rollback complete. Assess issues and prepare fix."

Delayed Rollback (After Hours)

#!/bin/bash
# Use if issues emerge after monitoring period

NAMESPACE="sparki-engine"

echo "⚠️ INITIATING DELAYED ROLLBACK..."

# 1. Scale Down Green (Current Bad Deployment)
kubectl scale deployment sparki-engine-green -n $NAMESPACE --replicas=0

# 2. Scale Up Blue (Rollback Deployment)
kubectl scale deployment sparki-engine-blue -n $NAMESPACE --replicas=3

# 3. Wait for Blue to be Ready
kubectl rollout status deployment/sparki-engine-blue \
  -n $NAMESPACE --timeout=600s

# 4. Update Service
kubectl patch service sparki-engine-lb \
  -n $NAMESPACE \
  -p '{"spec":{"selector":{"color":"blue"}}}'

echo "✅ Rolled back to blue deployment"

# 5. Assess Issues
echo "Check the following:"
echo "  - Application error logs"
echo "  - Jaeger traces for failed requests"
echo "  - Metrics for anomalies"
echo "  - Database migration status (if applicable)"

8. Post-Rollback Actions

Analysis

# 1. Collect Logs
kubectl logs deployment/sparki-engine-green -n $NAMESPACE --all-containers=true \
  > green-deployment-logs.txt

# 2. Export Traces
# Query Jaeger for errors during green deployment
# Export to analysis tool

# 3. Export Metrics
# Export Prometheus metrics for time window of green deployment

# 4. Create Incident Report
cat > incident-report.md <<EOF
# Incident Report: Failed Deployment

## Summary
Deployment of version $NEW_VERSION rolled back after $ROLLBACK_TIME minutes.

## Root Cause
[Investigation findings]

## Timeline
- [Start]: Version deployment began
- [Issue]: First error detected
- [Action]: Rollback initiated
- [Recovery]: Service restored

## Impact
- Duration: $DURATION minutes
- Users affected: [estimate]
- SLO impact: [SLO calculation]

## Resolution
[What fixed the issue]

## Prevention
[What to do next time]

## Action Items
- [ ] Fix identified issue
- [ ] Add monitoring for this scenario
- [ ] Update documentation
- [ ] Conduct post-incident review
EOF

9. Success Criteria Checklist

Quick Reference Commands

# View Current Status
kubectl get deployments -n sparki-engine
kubectl get svc sparki-engine-lb -n sparki-engine

# View Active Color
kubectl get svc sparki-engine-lb -n sparki-engine \
  -o jsonpath='{.spec.selector.color}'

# Check Green Logs
kubectl logs deployment/sparki-engine-green -n sparki-engine -f

# Switch to Green (for normal deploy)
kubectl patch service sparki-engine-lb -n sparki-engine \
  -p '{"spec":{"selector":{"color":"green"}}}'

# Switch to Blue (for emergency rollback)
kubectl patch service sparki-engine-lb -n sparki-engine \
  -p '{"spec":{"selector":{"color":"blue"}}}'

# Delete Old Deployment
kubectl delete deployment sparki-engine-blue -n sparki-engine

Runbook Version: 1.0
Last Updated: December 2025
Approved By: Platform Engineering
Next Review: March 2026

​Runbook: Blue-Green Deployment Strategy

​Overview

​1. Pre-Deployment Preparation

​Environment Setup

​Pre-Deployment Checks

​2. Deploy to Standby (Green)

​Scale Green Environment

​Smoke Tests on Green

​Database Migration (if needed)

​3. Validation Phase (Parallel Testing)

​Performance Baseline

​Canary Traffic (Optional 5% Step)

​4. Traffic Switch (Atomic)

​Prepare Switch

​Execute Switch

​Immediate Post-Switch Monitoring (5 minutes)

​5. Stability Validation (2 hours)

​Extended Monitoring

​Validation Checkpoints

​6. Old Environment Cleanup

​After 2-hour Validation

​7. Rollback Procedures

​Immediate Rollback (Within Minutes)

​Delayed Rollback (After Hours)

​8. Post-Rollback Actions

​Analysis

​9. Success Criteria Checklist

​Quick Reference Commands