Sparki Infrastructure - Complete Deployment Runbook
Overview
This runbook covers end-to-end deployment of the Sparki observability infrastructure across staging and production environments. It includes all components from Block 13: Observability Implementation. Covered Components:- ExternalSecrets ClusterSecretStore (AWS Secrets Manager integration)
- Prometheus + Alertmanager (metrics, alerting)
- Grafana (dashboards)
- Elasticsearch + Kibana + Fluentd (centralized logging)
- Elasticsearch Exporter (metrics scraping)
- PagerDuty integration (critical alerts)
- Istio observability (service mesh dashboards)
- Staging: Pre-production testing, non-critical workloads
- Production: Mission-critical services
- Fresh cluster: 45-60 minutes
- Staged deployment: 30 minutes per environment
Prerequisites
Before Starting
-
Cluster Access
- kubectl authenticated to staging and production clusters
- kubeconfig configured with correct contexts
- Admin role in both clusters
-
AWS Setup
-
Infrastructure Readiness
- Kubernetes 1.24+
- 4+ CPU and 8GB+ RAM available
- ExternalSecrets Operator deployed in
external-secretsnamespace - Istio 1.15+ deployed
- ECK (Elasticsearch Cloud on Kubernetes) deployed or provisioned
-
Required Tools
- kubectl
- helm
- jq (for JSON parsing)
- aws-cli
AWS Secrets Manager Setup
Create secrets in AWS (one-time setup, shared across environments):Deployment Procedure
Phase 1: Pre-Deployment Validation (10 min)
1.1 Verify Cluster Connectivity
1.2 Verify Prerequisites
1.3 Verify AWS Connectivity
Phase 2: ClusterSecretStore Deployment (5 min)
Deploy cluster-wide secret access for both environments.Phase 3: Staging Environment Deployment (25 min)
3.1 Deploy Observability Stack
3.2 Deploy Logging Stack
3.3 Deploy Istio Observability
3.4 Verify Staging Deployment
Phase 4: Production Deployment (25 min)
Follow the same steps as Phase 3, but with production context and overlays:Phase 5: Post-Deployment Verification (10 min)
5.1 Component Health
5.2 Integration Test
5.3 Dashboard Verification
Deployment Checklist
-
Pre-Deployment
- kubectl authenticated to both clusters
- AWS Secrets Manager secrets created
- ExternalSecrets Operator deployed
- Istio deployed
- Prerequisites verified
-
ClusterSecretStore
- Applied to both clusters
- Status: SecretStoreReady
-
Staging Observability
- Prometheus running and scraping targets
- Alertmanager configured and ready
- Grafana running with dashboards
- ExternalSecrets synced
-
Staging Logging
- Elasticsearch running (3 nodes)
- Kibana accessible
- Fluentd DaemonSet running (pods on all nodes)
- Bootstrap jobs completed
- Logs flowing into Elasticsearch
-
Staging Istio
- Dashboards provisioned
- Alert rules deployed
-
Staging Verification
- Health check script passes
- PagerDuty integration verified
- Test alert routes to PagerDuty
-
Production Deployment
- Same as Staging, with prod overlays
- Capacity monitoring in place
- On-call team notified
-
Post-Deployment
- All health checks passing
- Integration tests successful
- Dashboards visible in Grafana
- Alerts functioning end-to-end
- Documentation updated
Rollback Procedures
Rollback Staging
Rollback Production
Troubleshooting
Common Issues
Issue: ExternalSecret not syncingSupport & Documentation
- DEPLOYMENT_GUIDE.md - Detailed deployment steps
- VERIFICATION_GUIDE.md - Comprehensive verification
- OPERATIONS_RUNBOOK.md - Day-2 operations
- Logging Operations - ELK stack guide
- Istio Operations - Service mesh guide