Block 13 Observability - Complete Documentation Summary

Overview

This directory contains comprehensive documentation for the Sparki observability infrastructure deployed in Block 13. All runbooks and guides are production-ready and cover deployment, verification, operations, and troubleshooting.

Documentation Files

1. OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (Root Directory)

Purpose: Complete end-to-end deployment guide for entire observability infrastructure Who Should Use:

DevOps engineers deploying to staging/production
Platform team members doing initial setup
Anyone deploying for the first time

What It Covers:

Prerequisites (AWS setup, IRSA, cluster requirements)
5-phase deployment procedure (validation, ClusterSecretStore, staging, production, verification)
AWS Secrets Manager setup
Step-by-step deployment scripts for each component
Verification checklist
Rollback procedures
Troubleshooting common deployment issues

Time to Complete: 45-60 minutes (fresh cluster), 30 minutes (staged) Quick Start:

# Phase 1: Validate prerequisites
bash -x ./Phase1-validate.sh

# Phase 2: Deploy ClusterSecretStore
kubectl apply -f infra/kubernetes-manifests/base/external-secrets/secretstore/cluster-secret-store.yaml

# Phase 3: Deploy Staging
kubectl apply -k infra/kubernetes-manifests/overlays/staging/

# Phase 4: Deploy Production
kubectl apply -k infra/kubernetes-manifests/overlays/prod/

# Phase 5: Verify
bash OBSERVABILITY_QUICK_REFERENCE.md

2. OBSERVABILITY_QUICK_REFERENCE.md (Root Directory)

Purpose: Fast copy-paste reference for common operational tasks Who Should Use:

On-call engineers troubleshooting issues
Anyone needing quick answers
Operators doing daily/weekly tasks

What It Covers:

Prometheus quick commands (connect, check targets, reload)
Alertmanager quick commands (view alerts, silence, test routing)
Grafana quick commands (connect, list dashboards, restart)
Elasticsearch quick commands (health, indices, logs)
ExternalSecrets troubleshooting
Kubernetes resource debugging
Health checks (quick & deep)
Emergency commands

Usage: Copy-paste commands directly into terminal Example:

# From QUICK_REFERENCE.md:
kubectl port-forward -n observability svc/prometheus-kube-prom-prometheus 9090:9090
# Then navigate to: http://localhost:9090

3. infra/kubernetes-manifests/base/observability/DEPLOYMENT_GUIDE.md

Purpose: Detailed step-by-step deployment guide for observability stack only Who Should Use:

DevOps engineers deploying only observability (not full stack)
Anyone needing detailed explanation of each step
Customizing observability deployment

What It Covers:

Architecture overview (Prometheus, Alertmanager, Grafana, kube-prometheus-stack)
Prerequisites (cluster, AWS, IRSA)
10-step deployment procedure
Helm values configuration
Environment-specific overlays
Troubleshooting specific to observability (ExternalSecrets, ServiceMonitors, dashboards)
Rollback procedures

Time to Complete: 20-30 minutes Key Sections:

Step 1-5: Infrastructure setup
Step 6-9: Component deployment
Step 10: Verification

4. infra/kubernetes-manifests/base/observability/VERIFICATION_GUIDE.md

Purpose: Comprehensive verification procedures for post-deployment validation Who Should Use:

Verifying deployment succeeded
Running automated/manual health checks
Troubleshooting deployment issues
Setting up monitoring baseline

What It Covers:

Pre-deployment checks (AWS, IRSA, Kubernetes version)
Component health checks (ClusterSecretStore, ExternalSecrets, Prometheus, Alertmanager, Grafana)
Data flow verification (Prometheus scraping, Fluentd ingestion, metrics collection, dashboard provisioning)
Integration testing (alert routing, PagerDuty, Elasticsearch exporter)
Performance baseline (resource usage, latency)
Rollback validation
Automated verification script

Usage:

# Run all verification checks at once
bash ./infra/kubernetes-manifests/base/observability/scripts/verify-deployment.sh

# Or run individual checks from the guide

5. infra/kubernetes-manifests/base/observability/OPERATIONS_RUNBOOK.md

Purpose: Day-2 operations guide for running and troubleshooting observability stack Who Should Use:

On-call engineers
Daily operational tasks
Incident response
Scaling and tuning

What It Covers:

Quick reference (commands, endpoints, namespaces)
Daily/weekly/monthly/quarterly maintenance checklists
Incident response playbooks:
- INC-001: Alertmanager Down (symptoms, diagnosis, resolution)
- INC-002: Prometheus Not Scraping Targets
- INC-003: Grafana Dashboards Missing
- INC-004: High Memory Usage in Prometheus
Prometheus operations (query, reload config, view rules, scale)
Alertmanager operations (view alerts, silence, test routing)
Grafana operations (add dashboards, export, password reset)
Performance tuning procedures
Maintenance task checklists

Usage for Incident:

# Find INC-XXX matching your symptoms
# Follow diagnosis steps to identify root cause
# Follow resolution steps to fix the issue

6. infra/kubernetes-manifests/base/logging/OPERATIONS_RUNBOOK.md

Purpose: Operations guide for ELK stack (Elasticsearch, Fluentd, Kibana) Who Should Use:

Troubleshooting logging issues
Elasticsearch operations
Capacity planning for logging
Disaster recovery

What It Covers:

Quick reference (commands, endpoints, index families)
Health checks (cluster health, node status, Fluentd, Kibana)
Incident response:
- INC-001: Elasticsearch Cluster RED
- INC-002: Log Ingestion Stopped
- INC-003: Kibana Unavailable
Elasticsearch operations (scale, rotate indices, reindex, delete)
Fluentd troubleshooting (buffer, parser, memory)
Kibana operations (reset password, create patterns, export/import)
ILM management (view status, retry, move phases)
Snapshot & recovery procedures
Performance tuning

How to Use These Documents

Scenario 1: Initial Deployment

Read: OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (overview)
Prepare: AWS Secrets, IRSA role, prerequisites
Execute: Follow Phase 1-5 deployment steps
Verify: Run VERIFICATION_GUIDE.md automated script
Document: Capture sign-off in deployment runbook

Time: 1-2 hours including all phases

Scenario 2: Day-2 Operations

Daily task? → Check OBSERVABILITY_QUICK_REFERENCE.md
Recurring maintenance? → Check OPERATIONS_RUNBOOK.md maintenance checklists
Alert issues? → Check OPERATIONS_RUNBOOK.md incident playbooks
Logging issues? → Check logging/OPERATIONS_RUNBOOK.md

Scenario 3: Troubleshooting an Incident

Identify symptoms (Alertmanager down? No metrics? Dashboards blank?)
Find matching INC-XXX in OPERATIONS_RUNBOOK.md
Follow diagnosis steps
Follow resolution steps
Run verification from VERIFICATION_GUIDE.md to confirm fix
Document incident for future reference

Scenario 4: Customizing Deployment

Read: DEPLOYMENT_GUIDE.md (architecture & architecture decisions)
Modify: Kustomization overlays in infra/kubernetes-manifests/overlays/
Test: Run VERIFICATION_GUIDE.md to ensure changes work
Document: Update runbooks if significant changes made

Quick Reference Table

Task	Document	Section
Initial deployment	OBSERVABILITY_DEPLOYMENT_RUNBOOK.md	Phases 1-5
Verify deployment	VERIFICATION_GUIDE.md	All sections
Daily checks	OPERATIONS_RUNBOOK.md	Daily checklist
Alert not routing	OPERATIONS_RUNBOOK.md	INC-001
No metrics data	OPERATIONS_RUNBOOK.md	INC-002
Dashboards missing	OPERATIONS_RUNBOOK.md	INC-003
Restart component	QUICK_REFERENCE.md	Kubernetes Resources
Query metrics	QUICK_REFERENCE.md	Prometheus
Silence alert	QUICK_REFERENCE.md	Alertmanager
Check logs	QUICK_REFERENCE.md	Elasticsearch & Logging
Scale Prometheus	OPERATIONS_RUNBOOK.md	Prometheus Operations
Test PagerDuty	OPERATIONS_RUNBOOK.md	Alertmanager Operations

Key Architectural Decisions

These are documented in the deployment guides. Key points:

Least-Privilege Access: Fluentd and Exporter each have minimal required permissions
GitOps Everything: All dashboards as ConfigMaps, not manual Grafana imports
ExternalSecrets: Credentials sourced from AWS Secrets Manager, not K8s Secrets
Label-Based Discovery: Prometheus discovers via release: prometheus labels
Environment Isolation: Separate overlays for staging and production
No Manual Configuration: All config as YAML (Git source of truth)

Important Configuration Files

File	Purpose	Location
ClusterSecretStore	AWS Secrets Manager access	`base/external-secrets/secretstore/cluster-secret-store.yaml`
AlertmanagerConfig	PagerDuty routing rules	`base/observability/alertmanagerconfig-pagerduty.yaml`
Prometheus values	Helm chart configuration	`terraform-infrastructure/modules/observability/values/prometheus-values.yaml`
Fluentd bootstrap	Create ES user/role	`base/logging/elasticsearch/fluentd-bootstrap-job.yaml`
Exporter bootstrap	Create ES read-only user	`base/logging/elasticsearch/exporter-bootstrap-job.yaml`
Istio dashboards	Grafana dashboard ConfigMaps	`base/istio/observability/grafana-dashboards/`

Maintenance Calendar

Daily (5 min)

Run health check script from QUICK_REFERENCE.md
Glance at Grafana dashboards

Weekly (30 min)

Check Prometheus resource usage
Review active alerts in Alertmanager
Test alert routing with test alert
Verify ExternalSecrets are syncing

Monthly (1-2 hours)

Capacity planning review
High-cardinality metric analysis
Alert rule optimization
Update thresholds based on trends

Quarterly (4 hours)

Comprehensive audit of alerting rules
PagerDuty routing policy review
Disaster recovery test (export/import dashboards)
Performance optimization review
Plan capacity expansion

Support & Escalation

For Documentation Issues

Unclear instructions? → Update relevant .md file
Missing information? → Add to appropriate section
Broken commands? → Test and fix

For Operational Issues

First: Check QUICK_REFERENCE.md for quick fix
Second: Find INC-XXX matching symptoms in OPERATIONS_RUNBOOK.md
Third: Check infrastructure logs: kubectl logs -f -n observability -l app=<component>
Last: Escalate to Platform SME

For Deployment Issues

First: Check DEPLOYMENT_GUIDE.md troubleshooting section
Second: Check VERIFICATION_GUIDE.md for pre-deployment checks
Third: Re-read prerequisites and AWS setup steps
Last: Escalate to DevOps lead

Document Maintenance

These documents are living documents and should be updated:

After major deployments (add learnings)
When procedures change (keep in sync with code)
When new incidents occur (add new INC-XXX playbooks)
Quarterly review (ensure accuracy and completeness)

Last Updated: 2026-02-28 (Block 13 Completion) Next Review: 2026-05-28 (3 months)

OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (START HERE for deployment)
├── Phase 1: Validates prerequisites
├── Phase 2: ClusterSecretStore
├── Phase 3: Staging deployment
├── Phase 4: Production deployment
└── Phase 5: Verification

OBSERVABILITY_QUICK_REFERENCE.md (START HERE for operations)
├── Prometheus commands
├── Alertmanager commands
├── Grafana commands
├── Elasticsearch commands
└── Emergency commands

infra/kubernetes-manifests/base/observability/
├── DEPLOYMENT_GUIDE.md (detailed deployment)
├── VERIFICATION_GUIDE.md (comprehensive verification)
├── OPERATIONS_RUNBOOK.md (day-2 operations & incidents)
└── [manifests and dashboards]

infra/kubernetes-manifests/base/logging/
├── OPERATIONS_RUNBOOK.md (ELK stack operations)
└── [manifests]

Getting Help

“How do I deploy?” → OBSERVABILITY_DEPLOYMENT_RUNBOOK.md
“What’s the command to…?” → OBSERVABILITY_QUICK_REFERENCE.md
“How do I fix…?” → OPERATIONS_RUNBOOK.md or Incident Playbooks
“Is it working?” → VERIFICATION_GUIDE.md
“Why is…happening?” → Check troubleshooting in respective guide

Acknowledgments

These runbooks are based on:

Prometheus best practices
Alertmanager operational patterns
ELK stack documentation
Kubernetes Operator principles
AWS Secrets Manager integration patterns
Real-world operational experience from Block 13 implementation

Status: ✅ Complete and Ready for Production All components deployed, verified, and documented. Ready for deployment to staging and production clusters.

OBSERVABILITY DOCUMENTATION INDEX

Block 13 Observability - Complete Documentation Summary

Overview

Documentation Files

1. OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (Root Directory)

2. OBSERVABILITY_QUICK_REFERENCE.md (Root Directory)

3. infra/kubernetes-manifests/base/observability/DEPLOYMENT_GUIDE.md

4. infra/kubernetes-manifests/base/observability/VERIFICATION_GUIDE.md

5. infra/kubernetes-manifests/base/observability/OPERATIONS_RUNBOOK.md

6. infra/kubernetes-manifests/base/logging/OPERATIONS_RUNBOOK.md

How to Use These Documents

Scenario 1: Initial Deployment

Scenario 2: Day-2 Operations

Scenario 3: Troubleshooting an Incident

Scenario 4: Customizing Deployment

Quick Reference Table

Key Architectural Decisions

Important Configuration Files

Maintenance Calendar

Daily (5 min)

Weekly (30 min)

Monthly (1-2 hours)

Quarterly (4 hours)

Support & Escalation

For Documentation Issues

For Operational Issues

For Deployment Issues

Document Maintenance

Navigation Map

Getting Help

Acknowledgments

​Block 13 Observability - Complete Documentation Summary

​Overview

​Documentation Files

​1. OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (Root Directory)

​2. OBSERVABILITY_QUICK_REFERENCE.md (Root Directory)

​3. infra/kubernetes-manifests/base/observability/DEPLOYMENT_GUIDE.md

​4. infra/kubernetes-manifests/base/observability/VERIFICATION_GUIDE.md

​5. infra/kubernetes-manifests/base/observability/OPERATIONS_RUNBOOK.md

​6. infra/kubernetes-manifests/base/logging/OPERATIONS_RUNBOOK.md

​How to Use These Documents

​Scenario 1: Initial Deployment

​Scenario 2: Day-2 Operations

​Scenario 3: Troubleshooting an Incident

​Scenario 4: Customizing Deployment

​Quick Reference Table

​Key Architectural Decisions

​Important Configuration Files

​Maintenance Calendar

​Daily (5 min)

​Weekly (30 min)

​Monthly (1-2 hours)

​Quarterly (4 hours)

​Support & Escalation

​For Documentation Issues

​For Operational Issues

​For Deployment Issues

​Document Maintenance

​Navigation Map

​Getting Help

​Acknowledgments

Block 13 Observability - Complete Documentation Summary

Overview

Documentation Files

1. OBSERVABILITY_DEPLOYMENT_RUNBOOK.md (Root Directory)

2. OBSERVABILITY_QUICK_REFERENCE.md (Root Directory)

3. infra/kubernetes-manifests/base/observability/DEPLOYMENT_GUIDE.md

4. infra/kubernetes-manifests/base/observability/VERIFICATION_GUIDE.md

5. infra/kubernetes-manifests/base/observability/OPERATIONS_RUNBOOK.md

6. infra/kubernetes-manifests/base/logging/OPERATIONS_RUNBOOK.md

How to Use These Documents

Scenario 1: Initial Deployment

Scenario 2: Day-2 Operations

Scenario 3: Troubleshooting an Incident

Scenario 4: Customizing Deployment

Quick Reference Table

Key Architectural Decisions

Important Configuration Files

Maintenance Calendar

Daily (5 min)

Weekly (30 min)

Monthly (1-2 hours)

Quarterly (4 hours)

Support & Escalation

For Documentation Issues

For Operational Issues

For Deployment Issues

Document Maintenance

Navigation Map

Getting Help

Acknowledgments