Skip to main content

Sparki Infrastructure - Complete Deployment Runbook

Overview

This runbook covers end-to-end deployment of the Sparki observability infrastructure across staging and production environments. It includes all components from Block 13: Observability Implementation. Covered Components:
  • ExternalSecrets ClusterSecretStore (AWS Secrets Manager integration)
  • Prometheus + Alertmanager (metrics, alerting)
  • Grafana (dashboards)
  • Elasticsearch + Kibana + Fluentd (centralized logging)
  • Elasticsearch Exporter (metrics scraping)
  • PagerDuty integration (critical alerts)
  • Istio observability (service mesh dashboards)
Target Environments:
  • Staging: Pre-production testing, non-critical workloads
  • Production: Mission-critical services
Estimated Deployment Time:
  • Fresh cluster: 45-60 minutes
  • Staged deployment: 30 minutes per environment

Prerequisites

Before Starting

  1. Cluster Access
    • kubectl authenticated to staging and production clusters
    • kubeconfig configured with correct contexts
    • Admin role in both clusters
  2. AWS Setup
    # Verify AWS CLI configured
    aws sts get-caller-identity
    
    # Create secrets in AWS Secrets Manager (if not already done)
    # See: "AWS Secrets Setup" section below
    
  3. Infrastructure Readiness
    • Kubernetes 1.24+
    • 4+ CPU and 8GB+ RAM available
    • ExternalSecrets Operator deployed in external-secrets namespace
    • Istio 1.15+ deployed
    • ECK (Elasticsearch Cloud on Kubernetes) deployed or provisioned
  4. Required Tools
    • kubectl
    • helm
    • jq (for JSON parsing)
    • aws-cli

AWS Secrets Manager Setup

Create secrets in AWS (one-time setup, shared across environments):
#!/bin/bash
REGION="us-west-2"

# Staging secrets
aws secretsmanager create-secret \
  --name /sparki/staging/pagerduty/routing-key \
  --secret-string "<your-pagerduty-integration-key>" \
  --region $REGION \
  --tags Key=Environment,Value=staging Key=App,Value=sparki

aws secretsmanager create-secret \
  --name /sparki/staging/logging/fluentd-es-password \
  --secret-string "<generate-secure-password>" \
  --region $REGION \
  --tags Key=Environment,Value=staging

aws secretsmanager create-secret \
  --name /sparki/staging/logging/exporter-es-password \
  --secret-string "<generate-secure-password>" \
  --region $REGION \
  --tags Key=Environment,Value=staging

# Production secrets
aws secretsmanager create-secret \
  --name /sparki/prod/pagerduty/routing-key \
  --secret-string "<your-pagerduty-integration-key>" \
  --region $REGION \
  --tags Key=Environment,Value=prod Key=App,Value=sparki

aws secretsmanager create-secret \
  --name /sparki/prod/logging/fluentd-es-password \
  --secret-string "<generate-secure-password>" \
  --region $REGION \
  --tags Key=Environment,Value=prod

aws secretsmanager create-secret \
  --name /sparki/prod/logging/exporter-es-password \
  --secret-string "<generate-secure-password>" \
  --region $REGION \
  --tags Key=Environment,Value=prod

echo "AWS Secrets created successfully"

Deployment Procedure

Phase 1: Pre-Deployment Validation (10 min)

1.1 Verify Cluster Connectivity

#!/bin/bash
set -e

STAGING_CONTEXT="sparki-staging"
PROD_CONTEXT="sparki-prod"

echo "=== Cluster Connectivity Check ==="

# Test staging cluster
echo "Testing staging cluster..."
kubectl --context=$STAGING_CONTEXT cluster-info

# Test production cluster
echo "Testing production cluster..."
kubectl --context=$PROD_CONTEXT cluster-info

echo "✓ Both clusters reachable"

1.2 Verify Prerequisites

#!/bin/bash
check_prerequisite() {
  local context=$1
  local name=$2
  local command=$3
  
  echo -n "Checking $name (context: $context)... "
  if kubectl --context=$context $command &>/dev/null; then
    echo "✓"
    return 0
  else
    echo "✗"
    return 1
  fi
}

STAGING_CONTEXT="sparki-staging"
FAILED=0

# Check ExternalSecrets Operator
check_prerequisite $STAGING_CONTEXT "ExternalSecrets Operator" \
  "get pod -n external-secrets" || ((FAILED++))

# Check Istio
check_prerequisite $STAGING_CONTEXT "Istio" \
  "get namespace istio-system" || ((FAILED++))

# Check Kubernetes version
echo -n "Checking Kubernetes version... "
VERSION=$(kubectl --context=$STAGING_CONTEXT version -o json | jq -r .serverVersion.minor)
if [ "$VERSION" -ge 24 ]; then
  echo "✓ (1.$VERSION)"
else
  echo "✗ (requires 1.24+, have 1.$VERSION)"
  ((FAILED++))
fi

if [ $FAILED -gt 0 ]; then
  echo "✗ $FAILED prerequisite(s) failed"
  exit 1
else
  echo "✓ All prerequisites met"
fi

1.3 Verify AWS Connectivity

#!/bin/bash
echo "=== AWS Connectivity Check ==="

# Verify AWS CLI configured
aws sts get-caller-identity

# Verify secrets exist
for SECRET in \
  "/sparki/staging/pagerduty/routing-key" \
  "/sparki/staging/logging/fluentd-es-password" \
  "/sparki/staging/logging/exporter-es-password"
do
  echo -n "Checking secret: $SECRET... "
  if aws secretsmanager describe-secret --secret-id "$SECRET" &>/dev/null; then
    echo "✓"
  else
    echo "✗ Missing - create before deployment"
    exit 1
  fi
done

echo "✓ All AWS secrets present"

Phase 2: ClusterSecretStore Deployment (5 min)

Deploy cluster-wide secret access for both environments.
#!/bin/bash
set -e

echo "=== Deploying ClusterSecretStore ==="

# Apply ClusterSecretStore (works across all namespaces)
kubectl apply -f infra/kubernetes-manifests/base/external-secrets/secretstore/cluster-secret-store.yaml

# Wait for ClusterSecretStore to be ready
echo "Waiting for ClusterSecretStore to be ready..."
kubectl wait --for=condition=SecretStoreReady \
  clustersecretstore/aws-secrets \
  --timeout=120s

echo "✓ ClusterSecretStore ready"

Phase 3: Staging Environment Deployment (25 min)

3.1 Deploy Observability Stack

#!/bin/bash
set -e

CONTEXT="sparki-staging"
echo "=== Staging: Deploying Observability Stack ==="

# Switch context
kubectl config use-context $CONTEXT

# 1. Create namespace
kubectl apply -f infra/kubernetes-manifests/base/observability/namespace.yaml

# 2. Deploy base observability resources
kubectl apply -f infra/kubernetes-manifests/base/observability/

# 3. Add Prometheus Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 4. Create values file
cat > /tmp/prometheus-values-staging.yaml <<'EOF'
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector:
      matchLabels:
        release: prometheus
    ruleSelector:
      matchLabels:
        release: prometheus
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

grafana:
  adminPassword: "$(openssl rand -base64 12)"
EOF

# 5. Install kube-prometheus-stack
helm upgrade --install prometheus \
  prometheus-community/kube-prometheus-stack \
  -n observability \
  -f /tmp/prometheus-values-staging.yaml \
  --wait --timeout=10m

# 6. Deploy environment-specific overlay
kubectl apply -k infra/kubernetes-manifests/overlays/staging/observability/

echo "✓ Observability stack deployed"

3.2 Deploy Logging Stack

#!/bin/bash
set -e

CONTEXT="sparki-staging"
echo "=== Staging: Deploying Logging Stack ==="

kubectl config use-context $CONTEXT

# 1. Create namespace
kubectl apply -f infra/kubernetes-manifests/base/logging/namespace.yaml

# 2. Deploy Elasticsearch, Kibana, Fluentd
kubectl apply -k infra/kubernetes-manifests/base/logging/

# 3. Wait for Elasticsearch to be ready
echo "Waiting for Elasticsearch (this may take 5-10 minutes)..."
kubectl wait --for=condition=ready pod \
  -l elasticsearch.k8s.elastic.co/cluster-name=elasticsearch \
  -n logging \
  --timeout=600s

# 4. Run bootstrap jobs
echo "Running Elasticsearch bootstrap jobs..."
kubectl wait --for=condition=complete job \
  -l app=fluentd-es-bootstrap \
  -n logging \
  --timeout=120s

kubectl wait --for=condition=complete job \
  -l app=exporter-es-bootstrap \
  -n logging \
  --timeout=120s

# 5. Deploy environment-specific secrets
kubectl apply -k infra/kubernetes-manifests/overlays/staging/logging/

# 6. Verify Fluentd is running
echo "Verifying Fluentd..."
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=fluentd \
  -n logging \
  --timeout=120s

echo "✓ Logging stack deployed"

3.3 Deploy Istio Observability

#!/bin/bash
set -e

CONTEXT="sparki-staging"
echo "=== Staging: Deploying Istio Observability ==="

kubectl config use-context $CONTEXT

# Deploy Istio observability dashboards and rules
kubectl apply -k infra/kubernetes-manifests/base/istio/observability/

echo "✓ Istio observability deployed"

3.4 Verify Staging Deployment

#!/bin/bash
CONTEXT="sparki-staging"
echo "=== Staging Deployment Verification ==="

kubectl config use-context $CONTEXT

# Run verification script
bash ./infra/kubernetes-manifests/base/observability/scripts/verify-deployment.sh

if [ $? -eq 0 ]; then
  echo "✓ Staging deployment verified"
else
  echo "✗ Staging deployment verification failed - check logs"
  exit 1
fi

Phase 4: Production Deployment (25 min)

Follow the same steps as Phase 3, but with production context and overlays:
#!/bin/bash
set -e

CONTEXT="sparki-prod"
echo "=== Production: Deploying Full Stack ==="

kubectl config use-context $CONTEXT

# 1. Observability (follow Phase 3.1, use prod context)
kubectl apply -f infra/kubernetes-manifests/base/observability/namespace.yaml
kubectl apply -f infra/kubernetes-manifests/base/observability/

helm upgrade --install prometheus \
  prometheus-community/kube-prometheus-stack \
  -n observability \
  -f /tmp/prometheus-values-prod.yaml \
  --wait --timeout=10m

kubectl apply -k infra/kubernetes-manifests/overlays/prod/observability/

# 2. Logging (follow Phase 3.2, use prod context)
kubectl apply -f infra/kubernetes-manifests/base/logging/namespace.yaml
kubectl apply -k infra/kubernetes-manifests/base/logging/

kubectl wait --for=condition=ready pod \
  -l elasticsearch.k8s.elastic.co/cluster-name=elasticsearch \
  -n logging \
  --timeout=600s

kubectl wait --for=condition=complete job \
  -l app=fluentd-es-bootstrap \
  -n logging \
  --timeout=120s

kubectl apply -k infra/kubernetes-manifests/overlays/prod/logging/

# 3. Istio Observability (follow Phase 3.3, use prod context)
kubectl apply -k infra/kubernetes-manifests/base/istio/observability/

# 4. Verification
bash ./infra/kubernetes-manifests/base/observability/scripts/verify-deployment.sh

echo "✓ Production deployment complete"

Phase 5: Post-Deployment Verification (10 min)

5.1 Component Health

#!/bin/bash
echo "=== Post-Deployment Health Check ==="

# Check all deployments
echo "Checking deployments..."
kubectl get deploy -n observability
kubectl get deploy -n logging
kubectl get deploy -n istio-system

# Check StatefulSets
echo "Checking statefulsets..."
kubectl get sts -n observability
kubectl get sts -n logging

# Check PVCs
echo "Checking persistent volumes..."
kubectl get pvc -n observability
kubectl get pvc -n logging

# Resource usage
echo "Checking resource usage..."
kubectl top pods -n observability
kubectl top pods -n logging

5.2 Integration Test

#!/bin/bash
echo "=== Integration Test ==="

# 1. Log generation test
echo "1. Testing log ingestion..."
kubectl run --image=busybox test-log-producer -n logging -- sh -c 'for i in {1..10}; do echo "Test log $i"; sleep 1; done'

sleep 10

# Verify logs appear in Elasticsearch
ES_PASSWORD=$(kubectl get secret -n logging elasticsearch-es-elastic-user -o jsonpath='{.data.elastic}' | base64 -d)

kubectl exec -n logging elasticsearch-es-default-0 -- \
  curl -s -k -u elastic:$ES_PASSWORD \
  'https://localhost:9200/_cat/indices?v&h=index,docs.count' | head -20

# 2. Metrics test
echo "2. Testing metrics scraping..."
kubectl port-forward -n observability svc/prometheus-kube-prom-prometheus 9090:9090 &
sleep 2

# Query a metric
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result | length'

kill %1 2>/dev/null || true

# 3. Alert routing test
echo "3. Testing alert routing..."
kubectl port-forward -n observability svc/prometheus-kube-prom-alertmanager 9093:9093 &
sleep 2

# Check if alertmanager is ready
curl -s http://localhost:9093/-/healthy

kill %1 2>/dev/null || true

echo "✓ Integration tests complete"

5.3 Dashboard Verification

#!/bin/bash
echo "=== Dashboard Verification ==="

# Port-forward to Grafana
kubectl port-forward -n observability \
  svc/prometheus-kube-prom-grafana 3000:80 &
sleep 2

# Check Grafana API
GRAFANA_URL="http://localhost:3000"
GRAFANA_ADMIN_PASSWORD=$(kubectl get secret -n observability prometheus-kube-prom-grafana -o jsonpath='{.data.admin-password}' | base64 -d)

# Get list of dashboards
curl -s -u admin:$GRAFANA_ADMIN_PASSWORD \
  "$GRAFANA_URL/api/search?type=dash-db" | jq '.[] | .title'

kill %1 2>/dev/null || true

echo "✓ Dashboards verified"

Deployment Checklist

  • Pre-Deployment
    • kubectl authenticated to both clusters
    • AWS Secrets Manager secrets created
    • ExternalSecrets Operator deployed
    • Istio deployed
    • Prerequisites verified
  • ClusterSecretStore
    • Applied to both clusters
    • Status: SecretStoreReady
  • Staging Observability
    • Prometheus running and scraping targets
    • Alertmanager configured and ready
    • Grafana running with dashboards
    • ExternalSecrets synced
  • Staging Logging
    • Elasticsearch running (3 nodes)
    • Kibana accessible
    • Fluentd DaemonSet running (pods on all nodes)
    • Bootstrap jobs completed
    • Logs flowing into Elasticsearch
  • Staging Istio
    • Dashboards provisioned
    • Alert rules deployed
  • Staging Verification
    • Health check script passes
    • PagerDuty integration verified
    • Test alert routes to PagerDuty
  • Production Deployment
    • Same as Staging, with prod overlays
    • Capacity monitoring in place
    • On-call team notified
  • Post-Deployment
    • All health checks passing
    • Integration tests successful
    • Dashboards visible in Grafana
    • Alerts functioning end-to-end
    • Documentation updated

Rollback Procedures

Rollback Staging

#!/bin/bash
set -e
CONTEXT="sparki-staging"
kubectl config use-context $CONTEXT

# 1. Remove Helm release
helm uninstall prometheus -n observability

# 2. Remove manifests
kubectl delete -k infra/kubernetes-manifests/overlays/staging/observability/
kubectl delete -f infra/kubernetes-manifests/base/observability/

kubectl delete -k infra/kubernetes-manifests/overlays/staging/logging/
kubectl delete -k infra/kubernetes-manifests/base/logging/

kubectl delete -k infra/kubernetes-manifests/base/istio/observability/

# 3. Remove namespaces
kubectl delete namespace observability logging

echo "✓ Staging rolled back"

Rollback Production

# Same as rollback staging, but use prod context and overlays
CONTEXT="sparki-prod"
# ... (repeat rollback steps with prod context)

Troubleshooting

Common Issues

Issue: ExternalSecret not syncing
# Check ClusterSecretStore
kubectl describe clustersecretstore aws-secrets

# Check IRSA role permissions
aws iam get-role-policy --role-name sparki-external-secrets-irsa-role --policy-name SecretsManager
Issue: Prometheus targets DOWN
# Check ServiceMonitor labels
kubectl get servicemonitor -n logging -o yaml | grep -A 3 labels

# Reload Prometheus config
kubectl port-forward -n observability svc/prometheus-kube-prom-prometheus 9090:9090
# Then: curl -X POST http://localhost:9090/-/reload
Issue: Grafana dashboards missing
# Verify ConfigMaps exist
kubectl get configmap -n observability -l grafana_dashboard=1

# Check Grafana logs
kubectl logs -n observability -l app.kubernetes.io/name=grafana

Support & Documentation


Sign-Off

After successful deployment, document sign-off:
Environment: [Staging | Production]
Date: YYYY-MM-DD
Deployed By: [Name]
Verified By: [Name]
Issues: [None | List]
Next Steps: [Monitoring baseline, on-call rotation setup, etc.]