Skip to main content

Operational Runbooks - v01t.io Production Environment

Table of Contents

  1. System Architecture Overview
  2. Deployment Procedures
  3. Monitoring & Alerting
  4. Incident Response
  5. Disaster Recovery
  6. Performance Optimization
  7. Security Operations
  8. Data Management

System Architecture Overview

Production Environment Architecture

Infrastructure:
    Cloud Provider: AWS (Primary), Azure (DR)
    Orchestration: Kubernetes (EKS)
    Service Mesh: Istio
    Load Balancer: AWS ALB + CloudFlare

Core Services:
    - ecosystem-orchestrator (Node.js)
    - workflow-engine (Python/Celery)
    - integration-hub (Java/Spring)
    - content-scheduler (Python/Django)
    - analytics-engine (Python/Spark)
    - gamification-service (Go)

Data Layer:
    Primary: PostgreSQL cluster (RDS)
    Analytics: ClickHouse cluster
    Cache: Redis Cluster
    Message Queue: Apache Kafka
    File Storage: S3 + CloudFront CDN

Service Dependencies Map

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   API Gateway   │    │  Load Balancer  │    │   CloudFlare    │
└─────────┬───────┘    └─────────┬───────┘    └─────────┬───────┘
          │                      │                      │
          ▼                      ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Ecosystem Orch. │◄──►│ Workflow Engine │◄──►│Integration Hub  │
└─────────┬───────┘    └─────────┬───────┘    └─────────┬───────┘
          │                      │                      │
          ▼                      ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│Content Scheduler│◄──►│Analytics Engine │◄──►│Gamification Svc │
└─────────┬───────┘    └─────────┬───────┘    └─────────┬───────┘
          │                      │                      │
          ▼                      ▼                      ▼
┌─────────────────────────────────────────────────────────────┐
│                    Data Layer                               │
│  PostgreSQL  │  ClickHouse  │  Redis  │  Kafka  │    S3    │
└─────────────────────────────────────────────────────────────┘

Deployment Procedures

Standard Deployment Process

Pre-Deployment Checklist

  • All tests passing in staging environment
  • Security scan completed (no critical vulnerabilities)
  • Performance tests validated
  • Database migrations tested
  • Rollback plan prepared
  • Stakeholder notification sent
  • Deployment window approved

Blue-Green Deployment Steps

  1. Preparation Phase
# Verify current environment status
kubectl get pods -n production
kubectl get svc -n production

# Check application health
curl -f https://api.v01t.io/health
curl -f https://api.v01t.io/ready

# Backup current configuration
kubectl get configmap production-config -o yaml > backup-config-$(date +%Y%m%d-%H%M).yaml
  1. Green Environment Setup
# Deploy to green environment
helm upgrade v01t-green ./helm-charts/v01t \
  --namespace production-green \
  --set image.tag=$NEW_VERSION \
  --set environment=production-green

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod -l app=v01t -n production-green --timeout=600s

# Run smoke tests
./scripts/smoke-tests.sh production-green
  1. Traffic Switch
# Gradually shift traffic (10% increments)
kubectl patch service v01t-service -p '{"spec":{"selector":{"environment":"production-green","weight":"10"}}}'

# Monitor for 5 minutes, check metrics
./scripts/check-metrics.sh

# Continue traffic shift if healthy
kubectl patch service v01t-service -p '{"spec":{"selector":{"environment":"production-green","weight":"50"}}}'
# ... continue until 100%
  1. Cleanup
# Once green is stable, cleanup blue
kubectl delete namespace production-blue
kubectl label namespace production-green environment=production-blue

Emergency Hotfix Procedure

# For critical production issues requiring immediate fix
git checkout main
git cherry-pick $HOTFIX_COMMIT
git tag hotfix-v$(date +%Y%m%d-%H%M)

# Fast-track deployment (skips some checks)
./scripts/emergency-deploy.sh hotfix-v$(date +%Y%m%d-%H%M)

Monitoring & Alerting

Key Performance Indicators (KPIs)

System Health Metrics

API Response Time:
  Target: <500ms (95th percentile)
  Critical: >2000ms
  Alert: engineering-critical@v01t.io

Database Performance:
  Target: <100ms query time
  Critical: >1000ms
  Alert: database-team@v01t.io

Service Availability:
  Target: 99.9% uptime
  Critical: <99% in 24h window
  Alert: oncall-team@v01t.io

Memory Usage:
  Warning: >80% on any pod
  Critical: >95% on any pod
  Alert: infrastructure@v01t.io

Business Metrics

User Activation Rate:
  Target: >70% weekly
  Warning: <60%
  Alert: product-team@v01t.io

Revenue Recognition:
  Target: Daily MRR tracking
  Critical: >5% deviation from forecast
  Alert: finance@v01t.io

Error Rate:
  Target: <0.1% of requests
  Critical: >1% of requests
  Alert: engineering@v01t.io

Alerting Rules (Prometheus)

# High Error Rate Alert
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
  for: 2m
  labels:
      severity: critical
  annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value }} requests/second"

# Database Connection Pool Exhaustion
- alert: DatabaseConnectionExhaustion
  expr: database_connections_active / database_connections_max > 0.9
  for: 1m
  labels:
      severity: warning
  annotations:
      summary: "Database connection pool nearly exhausted"

# Kafka Consumer Lag
- alert: KafkaConsumerLag
  expr: kafka_consumer_lag_sum > 10000
  for: 5m
  labels:
      severity: critical
  annotations:
      summary: "Kafka consumer lag too high"

Dashboard Configuration

Executive Dashboard (Grafana)

  • Business KPIs: Revenue, Users, Activation Rate
  • System Health: Uptime, Error Rate, Response Time
  • Cost Metrics: Infrastructure spend, Cost per user

Engineering Dashboard

  • Service-level metrics for each microservice
  • Database performance and query analysis
  • Infrastructure utilization and scaling metrics

Operations Dashboard

  • Alert status and incident timeline
  • Deployment history and success rates
  • Security events and compliance status

Incident Response

Severity Levels

Severity 1 (Critical)

  • Definition: Complete service outage or major security breach
  • Response Time: 15 minutes
  • Escalation: Immediate CEO notification
  • Example: API completely down, data breach

Severity 2 (High)

  • Definition: Significant feature degradation affecting >50% users
  • Response Time: 30 minutes
  • Escalation: VP Engineering notification
  • Example: Database slow performance, payment processing issues

Severity 3 (Medium)

  • Definition: Minor feature issues affecting <25% users
  • Response Time: 2 hours
  • Escalation: Team lead notification
  • Example: Single persona functionality impaired

Severity 4 (Low)

  • Definition: Cosmetic issues or minor bugs
  • Response Time: Next business day
  • Escalation: Normal bug tracking process
  • Example: UI display issues, non-critical integrations

Incident Response Procedures

Step 1: Detection & Alert

# Automated alerts via PagerDuty
# Manual escalation process
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "YOUR_INTEGRATION_KEY",
    "event_action": "trigger",
    "payload": {
      "summary": "High error rate detected",
      "severity": "critical",
      "source": "monitoring"
    }
  }'

Step 2: Initial Response (War Room)

  1. Acknowledge Alert (< 5 minutes)
  2. Assess Impact (< 10 minutes)
    • Check monitoring dashboards
    • Verify user impact
    • Estimate revenue impact
  3. Form Response Team (< 15 minutes)
    • Incident Commander
    • Technical Lead
    • Communications Lead

Step 3: Investigation & Mitigation

# Quick diagnostic commands
kubectl top nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl logs -f deployment/api-gateway -n production

# Database health check
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "
SELECT
  schemaname,
  tablename,
  n_live_tup,
  n_dead_tup,
  last_vacuum,
  last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;"

# Check external dependencies
./scripts/health-check-external-apis.sh

Step 4: Communication Plan

# Internal Communication (Slack #incidents)

-   Initial assessment within 15 minutes
-   Updates every 30 minutes until resolved
-   Post-mortem scheduled within 24 hours

# External Communication (Status Page)

-   Initial notification within 30 minutes
-   Hourly updates for Severity 1 incidents
-   Resolution notification and summary

Step 5: Resolution & Post-Mortem

# Post-Mortem Template

## Incident Summary

-   Start Time:
-   End Time:
-   Duration:
-   Severity:
-   Impact:

## Root Cause Analysis

-   Immediate cause:
-   Contributing factors:
-   Detection time:

## Action Items

-   [ ] Immediate fixes (Owner: X, Due: Y)
-   [ ] Process improvements (Owner: X, Due: Y)
-   [ ] Monitoring enhancements (Owner: X, Due: Y)

Disaster Recovery

Recovery Time Objectives (RTO) & Recovery Point Objectives (RPO)

Service TierRTORPORecovery Method
Critical (API, Auth)15 minutes5 minutesHot standby, auto-failover
Important (Analytics)2 hours30 minutesWarm standby, manual failover
Standard (Reporting)24 hours4 hoursCold backup, manual restore

Backup Strategy

Database Backups

# Automated daily backups
#!/bin/bash
BACKUP_DATE=$(date +%Y%m%d-%H%M)
pg_dump -h $PRIMARY_DB_HOST -U backup_user v01t_production > /backups/postgres-$BACKUP_DATE.sql
aws s3 cp /backups/postgres-$BACKUP_DATE.sql s3://v01t-backups/database/

# Retention policy: 7 daily, 4 weekly, 12 monthly

Application State Backups

# Configuration backups
kubectl get configmaps --all-namespaces -o yaml > config-backup-$BACKUP_DATE.yaml
kubectl get secrets --all-namespaces -o yaml > secrets-backup-$BACKUP_DATE.yaml

# Redis backup
redis-cli -h $REDIS_HOST BGSAVE
redis-cli -h $REDIS_HOST LASTSAVE

Failover Procedures

Automated Failover (RTO < 15 minutes)

# Health check configuration
healthCheck:
    path: /health
    intervalSeconds: 30
    timeoutSeconds: 5
    failureThreshold: 3

# Auto-scaling configuration
autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70

Manual Failover (RTO < 2 hours)

# Promote read replica to primary
aws rds promote-read-replica --db-instance-identifier v01t-prod-replica

# Update DNS to point to DR region
aws route53 change-resource-record-sets --hosted-zone-id Z123456789 --change-batch file://dns-failover.json

# Scale up DR environment
kubectl scale deployment --replicas=5 --all -n production-dr

Testing Schedule

  • Monthly: Backup restoration test
  • Quarterly: Partial failover test
  • Annually: Full disaster recovery drill

Performance Optimization

Performance Monitoring

Application Performance Monitoring (APM)

# DataDog APM Configuration
apm:
    enabled: true
    trace_sampling_rate: 0.1
    profiling_enabled: true

metrics:
    - name: request_duration
      type: histogram
      buckets: [0.1, 0.25, 0.5, 1, 2.5, 5, 10]

    - name: database_query_duration
      type: histogram
      labels: [query_type, table]

Database Performance Optimization

-- Identify slow queries
SELECT
  query,
  calls,
  total_time,
  mean_time,
  rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

-- Index usage analysis
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

Caching Strategy

# Redis Cache Configuration
cache_layers:
    - name: api_responses
      ttl: 300s
      max_memory: 2GB
      eviction_policy: allkeys-lru

    - name: user_sessions
      ttl: 3600s
      max_memory: 1GB
      eviction_policy: volatile-ttl

    - name: analytics_data
      ttl: 1800s
      max_memory: 4GB
      eviction_policy: allkeys-lru

Auto-Scaling Configuration

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
    name: api-gateway-hpa
spec:
    scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api-gateway
    minReplicas: 3
    maxReplicas: 50
    metrics:
        - type: Resource
          resource:
              name: cpu
              target:
                  type: Utilization
                  averageUtilization: 70
        - type: Resource
          resource:
              name: memory
              target:
                  type: Utilization
                  averageUtilization: 80

Database Auto-Scaling

# RDS Auto Scaling for read replicas
aws application-autoscaling register-scalable-target \
  --service-namespace rds \
  --resource-id cluster:v01t-production \
  --scalable-dimension rds:cluster:ReadReplicaCount \
  --min-capacity 1 \
  --max-capacity 10

Security Operations

Security Monitoring

SIEM Configuration (Splunk/ELK)

# Log collection rules
log_sources:
    - name: application_logs
      path: /var/log/app/*.log
      type: json
      index: app_logs

    - name: access_logs
      path: /var/log/nginx/access.log
      type: nginx
      index: web_logs

    - name: audit_logs
      path: /var/log/audit/*.log
      type: linux_audit
      index: security_logs

# Security alerts
alerts:
    - name: suspicious_login_attempts
      query: "index=security_logs action=login result=failure | stats count by source_ip | where count > 10"
      threshold: 1
      action: block_ip

    - name: privilege_escalation
      query: "index=security_logs (sudo OR su) | stats count by user"
      threshold: 5
      action: notify_security_team

Vulnerability Scanning

# Daily security scans
#!/bin/bash
# Container image scanning
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
  aquasec/trivy image v01t/api-gateway:latest

# Infrastructure scanning
nmap -sS -O target_hosts.txt
nikto -h https://api.v01t.io

# Dependency scanning
npm audit --audit-level moderate
pip-audit --requirement requirements.txt

Access Control

Role-Based Access Control (RBAC)

# Kubernetes RBAC
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: v01t-developer
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "update"]

# Database access control
GRANT SELECT, INSERT, UPDATE ON user_data TO app_user;
GRANT SELECT ON analytics_data TO readonly_user;
REVOKE ALL ON sensitive_data FROM app_user;

Multi-Factor Authentication

# MFA enforcement script
#!/bin/bash
aws iam put-user-policy --user-name $USERNAME --policy-name EnforceMFA --policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "NotAction": "iam:*",
      "Resource": "*",
      "Condition": {
        "BoolIfExists": {
          "aws:MultiFactorAuthPresent": "false"
        }
      }
    }
  ]
}'

Data Management

Data Lifecycle Management

Data Retention Policies

data_retention:
    user_activity_logs: 90_days
    audit_logs: 7_years
    analytics_data: 3_years
    user_content: until_user_deletion
    financial_records: 7_years

automated_cleanup:
    enabled: true
    schedule: "0 2 * * 0" # Weekly at 2 AM Sunday
    notification: data-team@v01t.io

Data Backup & Recovery

# Automated backup script
#!/bin/bash
DATE=$(date +%Y%m%d)

# Database backup
pg_dump -h $DB_HOST -U $DB_USER v01t_production | gzip > backup-$DATE.sql.gz
aws s3 cp backup-$DATE.sql.gz s3://v01t-backups/daily/

# File storage backup
aws s3 sync s3://v01t-production-files s3://v01t-backup-files --delete

# Configuration backup
kubectl get all --all-namespaces -o yaml > k8s-backup-$DATE.yaml
aws s3 cp k8s-backup-$DATE.yaml s3://v01t-backups/config/

Data Privacy & Compliance

GDPR Compliance Procedures

# Data subject access request
./scripts/gdpr-data-export.sh --user-id $USER_ID --output-format json

# Right to be forgotten
./scripts/gdpr-data-deletion.sh --user-id $USER_ID --confirm

# Data processing audit
./scripts/gdpr-audit-trail.sh --start-date 2025-01-01 --end-date 2025-12-31

Data Encryption

# Encryption at rest
database:
  encryption: AES-256
  key_management: AWS_KMS

storage:
  s3_encryption: SSE-S3
  ebs_encryption: true

# Encryption in transit
api:
  tls_version: "1.3"
  cipher_suites: ["TLS_AES_256_GCM_SHA384", "TLS_CHACHA20_POLY1305_SHA256"]

database:
  ssl_mode: require
  ssl_cert: /etc/ssl/certs/db-client.crt

Contact Information & Escalation

On-Call Rotation

Emergency Contacts

  • CTO: +1-555-0001 (24/7)
  • VP Engineering: +1-555-0002
  • Security Lead: +1-555-0003
  • Database Admin: +1-555-0004

Service Vendors

  • AWS Support: Enterprise tier, 15-minute SLA
  • DataDog: Priority support, 1-hour SLA
  • CloudFlare: Enterprise support, 1-hour SLA

Last Updated: 2025-10-31
Next Review: 2025-11-30
Document Owner: VP Engineering