Block 13 Observability - Complete Documentation Summary
Overview
This directory contains comprehensive documentation for the Sparki observability infrastructure deployed in Block 13. All runbooks and guides are production-ready and cover deployment, verification, operations, and troubleshooting.Documentation Files
1. OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (Root Directory)
Purpose: Complete end-to-end deployment guide for entire observability infrastructure Who Should Use:- DevOps engineers deploying to staging/production
- Platform team members doing initial setup
- Anyone deploying for the first time
- Prerequisites (AWS setup, IRSA, cluster requirements)
- 5-phase deployment procedure (validation, ClusterSecretStore, staging, production, verification)
- AWS Secrets Manager setup
- Step-by-step deployment scripts for each component
- Verification checklist
- Rollback procedures
- Troubleshooting common deployment issues
2. OBSERVABILITY_QUICK_REFERENCE.md (Root Directory)
Purpose: Fast copy-paste reference for common operational tasks Who Should Use:- On-call engineers troubleshooting issues
- Anyone needing quick answers
- Operators doing daily/weekly tasks
- Prometheus quick commands (connect, check targets, reload)
- Alertmanager quick commands (view alerts, silence, test routing)
- Grafana quick commands (connect, list dashboards, restart)
- Elasticsearch quick commands (health, indices, logs)
- ExternalSecrets troubleshooting
- Kubernetes resource debugging
- Health checks (quick & deep)
- Emergency commands
3. infra/kubernetes-manifests/base/observability/DEPLOYMENT_GUIDE.md
Purpose: Detailed step-by-step deployment guide for observability stack only Who Should Use:- DevOps engineers deploying only observability (not full stack)
- Anyone needing detailed explanation of each step
- Customizing observability deployment
- Architecture overview (Prometheus, Alertmanager, Grafana, kube-prometheus-stack)
- Prerequisites (cluster, AWS, IRSA)
- 10-step deployment procedure
- Helm values configuration
- Environment-specific overlays
- Troubleshooting specific to observability (ExternalSecrets, ServiceMonitors, dashboards)
- Rollback procedures
- Step 1-5: Infrastructure setup
- Step 6-9: Component deployment
- Step 10: Verification
4. infra/kubernetes-manifests/base/observability/VERIFICATION_GUIDE.md
Purpose: Comprehensive verification procedures for post-deployment validation Who Should Use:- Verifying deployment succeeded
- Running automated/manual health checks
- Troubleshooting deployment issues
- Setting up monitoring baseline
- Pre-deployment checks (AWS, IRSA, Kubernetes version)
- Component health checks (ClusterSecretStore, ExternalSecrets, Prometheus, Alertmanager, Grafana)
- Data flow verification (Prometheus scraping, Fluentd ingestion, metrics collection, dashboard provisioning)
- Integration testing (alert routing, PagerDuty, Elasticsearch exporter)
- Performance baseline (resource usage, latency)
- Rollback validation
- Automated verification script
5. infra/kubernetes-manifests/base/observability/OPERATIONS_RUNBOOK.md
Purpose: Day-2 operations guide for running and troubleshooting observability stack Who Should Use:- On-call engineers
- Daily operational tasks
- Incident response
- Scaling and tuning
- Quick reference (commands, endpoints, namespaces)
- Daily/weekly/monthly/quarterly maintenance checklists
- Incident response playbooks:
- INC-001: Alertmanager Down (symptoms, diagnosis, resolution)
- INC-002: Prometheus Not Scraping Targets
- INC-003: Grafana Dashboards Missing
- INC-004: High Memory Usage in Prometheus
- Prometheus operations (query, reload config, view rules, scale)
- Alertmanager operations (view alerts, silence, test routing)
- Grafana operations (add dashboards, export, password reset)
- Performance tuning procedures
- Maintenance task checklists
6. infra/kubernetes-manifests/base/logging/OPERATIONS_RUNBOOK.md
Purpose: Operations guide for ELK stack (Elasticsearch, Fluentd, Kibana) Who Should Use:- Troubleshooting logging issues
- Elasticsearch operations
- Capacity planning for logging
- Disaster recovery
- Quick reference (commands, endpoints, index families)
- Health checks (cluster health, node status, Fluentd, Kibana)
- Incident response:
- INC-001: Elasticsearch Cluster RED
- INC-002: Log Ingestion Stopped
- INC-003: Kibana Unavailable
- Elasticsearch operations (scale, rotate indices, reindex, delete)
- Fluentd troubleshooting (buffer, parser, memory)
- Kibana operations (reset password, create patterns, export/import)
- ILM management (view status, retry, move phases)
- Snapshot & recovery procedures
- Performance tuning
How to Use These Documents
Scenario 1: Initial Deployment
- Read: OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (overview)
- Prepare: AWS Secrets, IRSA role, prerequisites
- Execute: Follow Phase 1-5 deployment steps
- Verify: Run VERIFICATION_GUIDE.md automated script
- Document: Capture sign-off in deployment runbook
Scenario 2: Day-2 Operations
- Daily task? → Check OBSERVABILITY_QUICK_REFERENCE.md
- Recurring maintenance? → Check OPERATIONS_RUNBOOK.md maintenance checklists
- Alert issues? → Check OPERATIONS_RUNBOOK.md incident playbooks
- Logging issues? → Check logging/OPERATIONS_RUNBOOK.md
Scenario 3: Troubleshooting an Incident
- Identify symptoms (Alertmanager down? No metrics? Dashboards blank?)
- Find matching INC-XXX in OPERATIONS_RUNBOOK.md
- Follow diagnosis steps
- Follow resolution steps
- Run verification from VERIFICATION_GUIDE.md to confirm fix
- Document incident for future reference
Scenario 4: Customizing Deployment
- Read: DEPLOYMENT_GUIDE.md (architecture & architecture decisions)
- Modify: Kustomization overlays in
infra/kubernetes-manifests/overlays/ - Test: Run VERIFICATION_GUIDE.md to ensure changes work
- Document: Update runbooks if significant changes made
Quick Reference Table
| Task | Document | Section |
|---|---|---|
| Initial deployment | OBSERVABILITY_DEPLOYMENT_RUNBOOK.md | Phases 1-5 |
| Verify deployment | VERIFICATION_GUIDE.md | All sections |
| Daily checks | OPERATIONS_RUNBOOK.md | Daily checklist |
| Alert not routing | OPERATIONS_RUNBOOK.md | INC-001 |
| No metrics data | OPERATIONS_RUNBOOK.md | INC-002 |
| Dashboards missing | OPERATIONS_RUNBOOK.md | INC-003 |
| Restart component | QUICK_REFERENCE.md | Kubernetes Resources |
| Query metrics | QUICK_REFERENCE.md | Prometheus |
| Silence alert | QUICK_REFERENCE.md | Alertmanager |
| Check logs | QUICK_REFERENCE.md | Elasticsearch & Logging |
| Scale Prometheus | OPERATIONS_RUNBOOK.md | Prometheus Operations |
| Test PagerDuty | OPERATIONS_RUNBOOK.md | Alertmanager Operations |
Key Architectural Decisions
These are documented in the deployment guides. Key points:- Least-Privilege Access: Fluentd and Exporter each have minimal required permissions
- GitOps Everything: All dashboards as ConfigMaps, not manual Grafana imports
- ExternalSecrets: Credentials sourced from AWS Secrets Manager, not K8s Secrets
- Label-Based Discovery: Prometheus discovers via
release: prometheuslabels - Environment Isolation: Separate overlays for staging and production
- No Manual Configuration: All config as YAML (Git source of truth)
Important Configuration Files
| File | Purpose | Location |
|---|---|---|
| ClusterSecretStore | AWS Secrets Manager access | base/external-secrets/secretstore/cluster-secret-store.yaml |
| AlertmanagerConfig | PagerDuty routing rules | base/observability/alertmanagerconfig-pagerduty.yaml |
| Prometheus values | Helm chart configuration | terraform-infrastructure/modules/observability/values/prometheus-values.yaml |
| Fluentd bootstrap | Create ES user/role | base/logging/elasticsearch/fluentd-bootstrap-job.yaml |
| Exporter bootstrap | Create ES read-only user | base/logging/elasticsearch/exporter-bootstrap-job.yaml |
| Istio dashboards | Grafana dashboard ConfigMaps | base/istio/observability/grafana-dashboards/ |
Maintenance Calendar
Daily (5 min)
- Run health check script from QUICK_REFERENCE.md
- Glance at Grafana dashboards
Weekly (30 min)
- Check Prometheus resource usage
- Review active alerts in Alertmanager
- Test alert routing with test alert
- Verify ExternalSecrets are syncing
Monthly (1-2 hours)
- Capacity planning review
- High-cardinality metric analysis
- Alert rule optimization
- Update thresholds based on trends
Quarterly (4 hours)
- Comprehensive audit of alerting rules
- PagerDuty routing policy review
- Disaster recovery test (export/import dashboards)
- Performance optimization review
- Plan capacity expansion
Support & Escalation
For Documentation Issues
- Unclear instructions? → Update relevant .md file
- Missing information? → Add to appropriate section
- Broken commands? → Test and fix
For Operational Issues
- First: Check QUICK_REFERENCE.md for quick fix
- Second: Find INC-XXX matching symptoms in OPERATIONS_RUNBOOK.md
- Third: Check infrastructure logs:
kubectl logs -f -n observability -l app=<component> - Last: Escalate to Platform SME
For Deployment Issues
- First: Check DEPLOYMENT_GUIDE.md troubleshooting section
- Second: Check VERIFICATION_GUIDE.md for pre-deployment checks
- Third: Re-read prerequisites and AWS setup steps
- Last: Escalate to DevOps lead
Document Maintenance
These documents are living documents and should be updated:- After major deployments (add learnings)
- When procedures change (keep in sync with code)
- When new incidents occur (add new INC-XXX playbooks)
- Quarterly review (ensure accuracy and completeness)
Navigation Map
Getting Help
- “How do I deploy?” → OBSERVABILITY_DEPLOYMENT_RUNBOOK.md
- “What’s the command to…?” → OBSERVABILITY_QUICK_REFERENCE.md
- “How do I fix…?” → OPERATIONS_RUNBOOK.md or Incident Playbooks
- “Is it working?” → VERIFICATION_GUIDE.md
- “Why is…happening?” → Check troubleshooting in respective guide
Acknowledgments
These runbooks are based on:- Prometheus best practices
- Alertmanager operational patterns
- ELK stack documentation
- Kubernetes Operator principles
- AWS Secrets Manager integration patterns
- Real-world operational experience from Block 13 implementation
Status: ✅ Complete and Ready for Production All components deployed, verified, and documented. Ready for deployment to staging and production clusters.