Skip to main content

Block 13 Observability - Complete Documentation Summary

Overview

This directory contains comprehensive documentation for the Sparki observability infrastructure deployed in Block 13. All runbooks and guides are production-ready and cover deployment, verification, operations, and troubleshooting.

Documentation Files

1. OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (Root Directory)

Purpose: Complete end-to-end deployment guide for entire observability infrastructure Who Should Use:
  • DevOps engineers deploying to staging/production
  • Platform team members doing initial setup
  • Anyone deploying for the first time
What It Covers:
  • Prerequisites (AWS setup, IRSA, cluster requirements)
  • 5-phase deployment procedure (validation, ClusterSecretStore, staging, production, verification)
  • AWS Secrets Manager setup
  • Step-by-step deployment scripts for each component
  • Verification checklist
  • Rollback procedures
  • Troubleshooting common deployment issues
Time to Complete: 45-60 minutes (fresh cluster), 30 minutes (staged) Quick Start:
# Phase 1: Validate prerequisites
bash -x ./Phase1-validate.sh

# Phase 2: Deploy ClusterSecretStore
kubectl apply -f infra/kubernetes-manifests/base/external-secrets/secretstore/cluster-secret-store.yaml

# Phase 3: Deploy Staging
kubectl apply -k infra/kubernetes-manifests/overlays/staging/

# Phase 4: Deploy Production
kubectl apply -k infra/kubernetes-manifests/overlays/prod/

# Phase 5: Verify
bash OBSERVABILITY_QUICK_REFERENCE.md

2. OBSERVABILITY_QUICK_REFERENCE.md (Root Directory)

Purpose: Fast copy-paste reference for common operational tasks Who Should Use:
  • On-call engineers troubleshooting issues
  • Anyone needing quick answers
  • Operators doing daily/weekly tasks
What It Covers:
  • Prometheus quick commands (connect, check targets, reload)
  • Alertmanager quick commands (view alerts, silence, test routing)
  • Grafana quick commands (connect, list dashboards, restart)
  • Elasticsearch quick commands (health, indices, logs)
  • ExternalSecrets troubleshooting
  • Kubernetes resource debugging
  • Health checks (quick & deep)
  • Emergency commands
Usage: Copy-paste commands directly into terminal Example:
# From QUICK_REFERENCE.md:
kubectl port-forward -n observability svc/prometheus-kube-prom-prometheus 9090:9090
# Then navigate to: http://localhost:9090

3. infra/kubernetes-manifests/base/observability/DEPLOYMENT_GUIDE.md

Purpose: Detailed step-by-step deployment guide for observability stack only Who Should Use:
  • DevOps engineers deploying only observability (not full stack)
  • Anyone needing detailed explanation of each step
  • Customizing observability deployment
What It Covers:
  • Architecture overview (Prometheus, Alertmanager, Grafana, kube-prometheus-stack)
  • Prerequisites (cluster, AWS, IRSA)
  • 10-step deployment procedure
  • Helm values configuration
  • Environment-specific overlays
  • Troubleshooting specific to observability (ExternalSecrets, ServiceMonitors, dashboards)
  • Rollback procedures
Time to Complete: 20-30 minutes Key Sections:
  • Step 1-5: Infrastructure setup
  • Step 6-9: Component deployment
  • Step 10: Verification

4. infra/kubernetes-manifests/base/observability/VERIFICATION_GUIDE.md

Purpose: Comprehensive verification procedures for post-deployment validation Who Should Use:
  • Verifying deployment succeeded
  • Running automated/manual health checks
  • Troubleshooting deployment issues
  • Setting up monitoring baseline
What It Covers:
  • Pre-deployment checks (AWS, IRSA, Kubernetes version)
  • Component health checks (ClusterSecretStore, ExternalSecrets, Prometheus, Alertmanager, Grafana)
  • Data flow verification (Prometheus scraping, Fluentd ingestion, metrics collection, dashboard provisioning)
  • Integration testing (alert routing, PagerDuty, Elasticsearch exporter)
  • Performance baseline (resource usage, latency)
  • Rollback validation
  • Automated verification script
Usage:
# Run all verification checks at once
bash ./infra/kubernetes-manifests/base/observability/scripts/verify-deployment.sh

# Or run individual checks from the guide

5. infra/kubernetes-manifests/base/observability/OPERATIONS_RUNBOOK.md

Purpose: Day-2 operations guide for running and troubleshooting observability stack Who Should Use:
  • On-call engineers
  • Daily operational tasks
  • Incident response
  • Scaling and tuning
What It Covers:
  • Quick reference (commands, endpoints, namespaces)
  • Daily/weekly/monthly/quarterly maintenance checklists
  • Incident response playbooks:
    • INC-001: Alertmanager Down (symptoms, diagnosis, resolution)
    • INC-002: Prometheus Not Scraping Targets
    • INC-003: Grafana Dashboards Missing
    • INC-004: High Memory Usage in Prometheus
  • Prometheus operations (query, reload config, view rules, scale)
  • Alertmanager operations (view alerts, silence, test routing)
  • Grafana operations (add dashboards, export, password reset)
  • Performance tuning procedures
  • Maintenance task checklists
Usage for Incident:
# Find INC-XXX matching your symptoms
# Follow diagnosis steps to identify root cause
# Follow resolution steps to fix the issue

6. infra/kubernetes-manifests/base/logging/OPERATIONS_RUNBOOK.md

Purpose: Operations guide for ELK stack (Elasticsearch, Fluentd, Kibana) Who Should Use:
  • Troubleshooting logging issues
  • Elasticsearch operations
  • Capacity planning for logging
  • Disaster recovery
What It Covers:
  • Quick reference (commands, endpoints, index families)
  • Health checks (cluster health, node status, Fluentd, Kibana)
  • Incident response:
    • INC-001: Elasticsearch Cluster RED
    • INC-002: Log Ingestion Stopped
    • INC-003: Kibana Unavailable
  • Elasticsearch operations (scale, rotate indices, reindex, delete)
  • Fluentd troubleshooting (buffer, parser, memory)
  • Kibana operations (reset password, create patterns, export/import)
  • ILM management (view status, retry, move phases)
  • Snapshot & recovery procedures
  • Performance tuning

How to Use These Documents

Scenario 1: Initial Deployment

  1. Read: OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (overview)
  2. Prepare: AWS Secrets, IRSA role, prerequisites
  3. Execute: Follow Phase 1-5 deployment steps
  4. Verify: Run VERIFICATION_GUIDE.md automated script
  5. Document: Capture sign-off in deployment runbook
Time: 1-2 hours including all phases

Scenario 2: Day-2 Operations

  1. Daily task? → Check OBSERVABILITY_QUICK_REFERENCE.md
  2. Recurring maintenance? → Check OPERATIONS_RUNBOOK.md maintenance checklists
  3. Alert issues? → Check OPERATIONS_RUNBOOK.md incident playbooks
  4. Logging issues? → Check logging/OPERATIONS_RUNBOOK.md

Scenario 3: Troubleshooting an Incident

  1. Identify symptoms (Alertmanager down? No metrics? Dashboards blank?)
  2. Find matching INC-XXX in OPERATIONS_RUNBOOK.md
  3. Follow diagnosis steps
  4. Follow resolution steps
  5. Run verification from VERIFICATION_GUIDE.md to confirm fix
  6. Document incident for future reference

Scenario 4: Customizing Deployment

  1. Read: DEPLOYMENT_GUIDE.md (architecture & architecture decisions)
  2. Modify: Kustomization overlays in infra/kubernetes-manifests/overlays/
  3. Test: Run VERIFICATION_GUIDE.md to ensure changes work
  4. Document: Update runbooks if significant changes made

Quick Reference Table

TaskDocumentSection
Initial deploymentOBSERVABILITY_DEPLOYMENT_RUNBOOK.mdPhases 1-5
Verify deploymentVERIFICATION_GUIDE.mdAll sections
Daily checksOPERATIONS_RUNBOOK.mdDaily checklist
Alert not routingOPERATIONS_RUNBOOK.mdINC-001
No metrics dataOPERATIONS_RUNBOOK.mdINC-002
Dashboards missingOPERATIONS_RUNBOOK.mdINC-003
Restart componentQUICK_REFERENCE.mdKubernetes Resources
Query metricsQUICK_REFERENCE.mdPrometheus
Silence alertQUICK_REFERENCE.mdAlertmanager
Check logsQUICK_REFERENCE.mdElasticsearch & Logging
Scale PrometheusOPERATIONS_RUNBOOK.mdPrometheus Operations
Test PagerDutyOPERATIONS_RUNBOOK.mdAlertmanager Operations

Key Architectural Decisions

These are documented in the deployment guides. Key points:
  1. Least-Privilege Access: Fluentd and Exporter each have minimal required permissions
  2. GitOps Everything: All dashboards as ConfigMaps, not manual Grafana imports
  3. ExternalSecrets: Credentials sourced from AWS Secrets Manager, not K8s Secrets
  4. Label-Based Discovery: Prometheus discovers via release: prometheus labels
  5. Environment Isolation: Separate overlays for staging and production
  6. No Manual Configuration: All config as YAML (Git source of truth)

Important Configuration Files

FilePurposeLocation
ClusterSecretStoreAWS Secrets Manager accessbase/external-secrets/secretstore/cluster-secret-store.yaml
AlertmanagerConfigPagerDuty routing rulesbase/observability/alertmanagerconfig-pagerduty.yaml
Prometheus valuesHelm chart configurationterraform-infrastructure/modules/observability/values/prometheus-values.yaml
Fluentd bootstrapCreate ES user/rolebase/logging/elasticsearch/fluentd-bootstrap-job.yaml
Exporter bootstrapCreate ES read-only userbase/logging/elasticsearch/exporter-bootstrap-job.yaml
Istio dashboardsGrafana dashboard ConfigMapsbase/istio/observability/grafana-dashboards/

Maintenance Calendar

Daily (5 min)

  • Run health check script from QUICK_REFERENCE.md
  • Glance at Grafana dashboards

Weekly (30 min)

  • Check Prometheus resource usage
  • Review active alerts in Alertmanager
  • Test alert routing with test alert
  • Verify ExternalSecrets are syncing

Monthly (1-2 hours)

  • Capacity planning review
  • High-cardinality metric analysis
  • Alert rule optimization
  • Update thresholds based on trends

Quarterly (4 hours)

  • Comprehensive audit of alerting rules
  • PagerDuty routing policy review
  • Disaster recovery test (export/import dashboards)
  • Performance optimization review
  • Plan capacity expansion

Support & Escalation

For Documentation Issues

  • Unclear instructions? → Update relevant .md file
  • Missing information? → Add to appropriate section
  • Broken commands? → Test and fix

For Operational Issues

  1. First: Check QUICK_REFERENCE.md for quick fix
  2. Second: Find INC-XXX matching symptoms in OPERATIONS_RUNBOOK.md
  3. Third: Check infrastructure logs: kubectl logs -f -n observability -l app=<component>
  4. Last: Escalate to Platform SME

For Deployment Issues

  1. First: Check DEPLOYMENT_GUIDE.md troubleshooting section
  2. Second: Check VERIFICATION_GUIDE.md for pre-deployment checks
  3. Third: Re-read prerequisites and AWS setup steps
  4. Last: Escalate to DevOps lead

Document Maintenance

These documents are living documents and should be updated:
  • After major deployments (add learnings)
  • When procedures change (keep in sync with code)
  • When new incidents occur (add new INC-XXX playbooks)
  • Quarterly review (ensure accuracy and completeness)
Last Updated: 2026-02-28 (Block 13 Completion) Next Review: 2026-05-28 (3 months)
OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (START HERE for deployment)
├── Phase 1: Validates prerequisites
├── Phase 2: ClusterSecretStore
├── Phase 3: Staging deployment
├── Phase 4: Production deployment
└── Phase 5: Verification

OBSERVABILITY_QUICK_REFERENCE.md (START HERE for operations)
├── Prometheus commands
├── Alertmanager commands
├── Grafana commands
├── Elasticsearch commands
└── Emergency commands

infra/kubernetes-manifests/base/observability/
├── DEPLOYMENT_GUIDE.md (detailed deployment)
├── VERIFICATION_GUIDE.md (comprehensive verification)
├── OPERATIONS_RUNBOOK.md (day-2 operations & incidents)
└── [manifests and dashboards]

infra/kubernetes-manifests/base/logging/
├── OPERATIONS_RUNBOOK.md (ELK stack operations)
└── [manifests]

Getting Help

  1. “How do I deploy?” → OBSERVABILITY_DEPLOYMENT_RUNBOOK.md
  2. “What’s the command to…?” → OBSERVABILITY_QUICK_REFERENCE.md
  3. “How do I fix…?” → OPERATIONS_RUNBOOK.md or Incident Playbooks
  4. “Is it working?” → VERIFICATION_GUIDE.md
  5. “Why is…happening?” → Check troubleshooting in respective guide

Acknowledgments

These runbooks are based on:
  • Prometheus best practices
  • Alertmanager operational patterns
  • ELK stack documentation
  • Kubernetes Operator principles
  • AWS Secrets Manager integration patterns
  • Real-world operational experience from Block 13 implementation

Status: ✅ Complete and Ready for Production All components deployed, verified, and documented. Ready for deployment to staging and production clusters.