Operational Runbooks - v01t.io Production Environment

System Architecture Overview
Deployment Procedures
Monitoring & Alerting
Incident Response
Disaster Recovery
Performance Optimization
Security Operations
Data Management

System Architecture Overview

Production Environment Architecture

Infrastructure:
    Cloud Provider: AWS (Primary), Azure (DR)
    Orchestration: Kubernetes (EKS)
    Service Mesh: Istio
    Load Balancer: AWS ALB + CloudFlare

Core Services:
    - ecosystem-orchestrator (Node.js)
    - workflow-engine (Python/Celery)
    - integration-hub (Java/Spring)
    - content-scheduler (Python/Django)
    - analytics-engine (Python/Spark)
    - gamification-service (Go)

Data Layer:
    Primary: PostgreSQL cluster (RDS)
    Analytics: ClickHouse cluster
    Cache: Redis Cluster
    Message Queue: Apache Kafka
    File Storage: S3 + CloudFront CDN

Service Dependencies Map

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   API Gateway   │    │  Load Balancer  │    │   CloudFlare    │
└─────────┬───────┘    └─────────┬───────┘    └─────────┬───────┘
          │                      │                      │
          ▼                      ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Ecosystem Orch. │◄──►│ Workflow Engine │◄──►│Integration Hub  │
└─────────┬───────┘    └─────────┬───────┘    └─────────┬───────┘
          │                      │                      │
          ▼                      ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│Content Scheduler│◄──►│Analytics Engine │◄──►│Gamification Svc │
└─────────┬───────┘    └─────────┬───────┘    └─────────┬───────┘
          │                      │                      │
          ▼                      ▼                      ▼
┌─────────────────────────────────────────────────────────────┐
│                    Data Layer                               │
│  PostgreSQL  │  ClickHouse  │  Redis  │  Kafka  │    S3    │
└─────────────────────────────────────────────────────────────┘

Deployment Procedures

Standard Deployment Process

Pre-Deployment Checklist

All tests passing in staging environment
Security scan completed (no critical vulnerabilities)
Performance tests validated
Database migrations tested
Rollback plan prepared
Stakeholder notification sent
Deployment window approved

Blue-Green Deployment Steps

Preparation Phase

# Verify current environment status
kubectl get pods -n production
kubectl get svc -n production

# Check application health
curl -f https://api.v01t.io/health
curl -f https://api.v01t.io/ready

# Backup current configuration
kubectl get configmap production-config -o yaml > backup-config-$(date +%Y%m%d-%H%M).yaml

Green Environment Setup

# Deploy to green environment
helm upgrade v01t-green ./helm-charts/v01t \
  --namespace production-green \
  --set image.tag=$NEW_VERSION \
  --set environment=production-green

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod -l app=v01t -n production-green --timeout=600s

# Run smoke tests
./scripts/smoke-tests.sh production-green

Traffic Switch

# Gradually shift traffic (10% increments)
kubectl patch service v01t-service -p '{"spec":{"selector":{"environment":"production-green","weight":"10"}}}'

# Monitor for 5 minutes, check metrics
./scripts/check-metrics.sh

# Continue traffic shift if healthy
kubectl patch service v01t-service -p '{"spec":{"selector":{"environment":"production-green","weight":"50"}}}'
# ... continue until 100%

Cleanup

# Once green is stable, cleanup blue
kubectl delete namespace production-blue
kubectl label namespace production-green environment=production-blue

Emergency Hotfix Procedure

# For critical production issues requiring immediate fix
git checkout main
git cherry-pick $HOTFIX_COMMIT
git tag hotfix-v$(date +%Y%m%d-%H%M)

# Fast-track deployment (skips some checks)
./scripts/emergency-deploy.sh hotfix-v$(date +%Y%m%d-%H%M)

Monitoring & Alerting

Key Performance Indicators (KPIs)

System Health Metrics

API Response Time:
  Target: <500ms (95th percentile)
  Critical: >2000ms
  Alert: engineering-critical@v01t.io

Database Performance:
  Target: <100ms query time
  Critical: >1000ms
  Alert: database-team@v01t.io

Service Availability:
  Target: 99.9% uptime
  Critical: <99% in 24h window
  Alert: oncall-team@v01t.io

Memory Usage:
  Warning: >80% on any pod
  Critical: >95% on any pod
  Alert: infrastructure@v01t.io

Business Metrics

User Activation Rate:
  Target: >70% weekly
  Warning: <60%
  Alert: product-team@v01t.io

Revenue Recognition:
  Target: Daily MRR tracking
  Critical: >5% deviation from forecast
  Alert: finance@v01t.io

Error Rate:
  Target: <0.1% of requests
  Critical: >1% of requests
  Alert: engineering@v01t.io

Alerting Rules (Prometheus)

# High Error Rate Alert
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
  for: 2m
  labels:
      severity: critical
  annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value }} requests/second"

# Database Connection Pool Exhaustion
- alert: DatabaseConnectionExhaustion
  expr: database_connections_active / database_connections_max > 0.9
  for: 1m
  labels:
      severity: warning
  annotations:
      summary: "Database connection pool nearly exhausted"

# Kafka Consumer Lag
- alert: KafkaConsumerLag
  expr: kafka_consumer_lag_sum > 10000
  for: 5m
  labels:
      severity: critical
  annotations:
      summary: "Kafka consumer lag too high"

Dashboard Configuration

Executive Dashboard (Grafana)

Business KPIs: Revenue, Users, Activation Rate
System Health: Uptime, Error Rate, Response Time
Cost Metrics: Infrastructure spend, Cost per user

Engineering Dashboard

Service-level metrics for each microservice
Database performance and query analysis
Infrastructure utilization and scaling metrics

Operations Dashboard

Alert status and incident timeline
Deployment history and success rates
Security events and compliance status

Incident Response

Severity Levels

Severity 1 (Critical)

Definition: Complete service outage or major security breach
Response Time: 15 minutes
Escalation: Immediate CEO notification
Example: API completely down, data breach

Severity 2 (High)

Definition: Significant feature degradation affecting >50% users
Response Time: 30 minutes
Escalation: VP Engineering notification
Example: Database slow performance, payment processing issues

Severity 3 (Medium)

Definition: Minor feature issues affecting <25% users
Response Time: 2 hours
Escalation: Team lead notification
Example: Single persona functionality impaired

Severity 4 (Low)

Definition: Cosmetic issues or minor bugs
Response Time: Next business day
Escalation: Normal bug tracking process
Example: UI display issues, non-critical integrations

Incident Response Procedures

Step 1: Detection & Alert

# Automated alerts via PagerDuty
# Manual escalation process
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "YOUR_INTEGRATION_KEY",
    "event_action": "trigger",
    "payload": {
      "summary": "High error rate detected",
      "severity": "critical",
      "source": "monitoring"
    }
  }'

Step 2: Initial Response (War Room)

Acknowledge Alert (< 5 minutes)
Assess Impact (< 10 minutes)
- Check monitoring dashboards
- Verify user impact
- Estimate revenue impact
Form Response Team (< 15 minutes)
- Incident Commander
- Technical Lead
- Communications Lead

Step 3: Investigation & Mitigation

# Quick diagnostic commands
kubectl top nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl logs -f deployment/api-gateway -n production

# Database health check
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "
SELECT
  schemaname,
  tablename,
  n_live_tup,
  n_dead_tup,
  last_vacuum,
  last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;"

# Check external dependencies
./scripts/health-check-external-apis.sh

Step 4: Communication Plan

# Internal Communication (Slack #incidents)

-   Initial assessment within 15 minutes
-   Updates every 30 minutes until resolved
-   Post-mortem scheduled within 24 hours

# External Communication (Status Page)

-   Initial notification within 30 minutes
-   Hourly updates for Severity 1 incidents
-   Resolution notification and summary

Step 5: Resolution & Post-Mortem

# Post-Mortem Template

## Incident Summary

-   Start Time:
-   End Time:
-   Duration:
-   Severity:
-   Impact:

## Root Cause Analysis

-   Immediate cause:
-   Contributing factors:
-   Detection time:

## Action Items

-   [ ] Immediate fixes (Owner: X, Due: Y)
-   [ ] Process improvements (Owner: X, Due: Y)
-   [ ] Monitoring enhancements (Owner: X, Due: Y)

Disaster Recovery

Recovery Time Objectives (RTO) & Recovery Point Objectives (RPO)

Service Tier	RTO	RPO	Recovery Method
Critical (API, Auth)	15 minutes	5 minutes	Hot standby, auto-failover
Important (Analytics)	2 hours	30 minutes	Warm standby, manual failover
Standard (Reporting)	24 hours	4 hours	Cold backup, manual restore

Backup Strategy

Database Backups

# Automated daily backups
#!/bin/bash
BACKUP_DATE=$(date +%Y%m%d-%H%M)
pg_dump -h $PRIMARY_DB_HOST -U backup_user v01t_production > /backups/postgres-$BACKUP_DATE.sql
aws s3 cp /backups/postgres-$BACKUP_DATE.sql s3://v01t-backups/database/

# Retention policy: 7 daily, 4 weekly, 12 monthly

Application State Backups

# Configuration backups
kubectl get configmaps --all-namespaces -o yaml > config-backup-$BACKUP_DATE.yaml
kubectl get secrets --all-namespaces -o yaml > secrets-backup-$BACKUP_DATE.yaml

# Redis backup
redis-cli -h $REDIS_HOST BGSAVE
redis-cli -h $REDIS_HOST LASTSAVE

Failover Procedures

Automated Failover (RTO < 15 minutes)

# Health check configuration
healthCheck:
    path: /health
    intervalSeconds: 30
    timeoutSeconds: 5
    failureThreshold: 3

# Auto-scaling configuration
autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70

Manual Failover (RTO < 2 hours)

# Promote read replica to primary
aws rds promote-read-replica --db-instance-identifier v01t-prod-replica

# Update DNS to point to DR region
aws route53 change-resource-record-sets --hosted-zone-id Z123456789 --change-batch file://dns-failover.json

# Scale up DR environment
kubectl scale deployment --replicas=5 --all -n production-dr

Testing Schedule

Monthly: Backup restoration test
Quarterly: Partial failover test
Annually: Full disaster recovery drill

Performance Optimization

Performance Monitoring

Application Performance Monitoring (APM)

# DataDog APM Configuration
apm:
    enabled: true
    trace_sampling_rate: 0.1
    profiling_enabled: true

metrics:
    - name: request_duration
      type: histogram
      buckets: [0.1, 0.25, 0.5, 1, 2.5, 5, 10]

    - name: database_query_duration
      type: histogram
      labels: [query_type, table]

Database Performance Optimization

-- Identify slow queries
SELECT
  query,
  calls,
  total_time,
  mean_time,
  rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

-- Index usage analysis
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

Caching Strategy

# Redis Cache Configuration
cache_layers:
    - name: api_responses
      ttl: 300s
      max_memory: 2GB
      eviction_policy: allkeys-lru

    - name: user_sessions
      ttl: 3600s
      max_memory: 1GB
      eviction_policy: volatile-ttl

    - name: analytics_data
      ttl: 1800s
      max_memory: 4GB
      eviction_policy: allkeys-lru

Auto-Scaling Configuration

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
    name: api-gateway-hpa
spec:
    scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api-gateway
    minReplicas: 3
    maxReplicas: 50
    metrics:
        - type: Resource
          resource:
              name: cpu
              target:
                  type: Utilization
                  averageUtilization: 70
        - type: Resource
          resource:
              name: memory
              target:
                  type: Utilization
                  averageUtilization: 80

Database Auto-Scaling

# RDS Auto Scaling for read replicas
aws application-autoscaling register-scalable-target \
  --service-namespace rds \
  --resource-id cluster:v01t-production \
  --scalable-dimension rds:cluster:ReadReplicaCount \
  --min-capacity 1 \
  --max-capacity 10

Security Operations

Security Monitoring

SIEM Configuration (Splunk/ELK)

# Log collection rules
log_sources:
    - name: application_logs
      path: /var/log/app/*.log
      type: json
      index: app_logs

    - name: access_logs
      path: /var/log/nginx/access.log
      type: nginx
      index: web_logs

    - name: audit_logs
      path: /var/log/audit/*.log
      type: linux_audit
      index: security_logs

# Security alerts
alerts:
    - name: suspicious_login_attempts
      query: "index=security_logs action=login result=failure | stats count by source_ip | where count > 10"
      threshold: 1
      action: block_ip

    - name: privilege_escalation
      query: "index=security_logs (sudo OR su) | stats count by user"
      threshold: 5
      action: notify_security_team

Vulnerability Scanning

# Daily security scans
#!/bin/bash
# Container image scanning
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
  aquasec/trivy image v01t/api-gateway:latest

# Infrastructure scanning
nmap -sS -O target_hosts.txt
nikto -h https://api.v01t.io

# Dependency scanning
npm audit --audit-level moderate
pip-audit --requirement requirements.txt

Access Control

Role-Based Access Control (RBAC)

# Kubernetes RBAC
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: v01t-developer
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "update"]

# Database access control
GRANT SELECT, INSERT, UPDATE ON user_data TO app_user;
GRANT SELECT ON analytics_data TO readonly_user;
REVOKE ALL ON sensitive_data FROM app_user;

Multi-Factor Authentication

# MFA enforcement script
#!/bin/bash
aws iam put-user-policy --user-name $USERNAME --policy-name EnforceMFA --policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "NotAction": "iam:*",
      "Resource": "*",
      "Condition": {
        "BoolIfExists": {
          "aws:MultiFactorAuthPresent": "false"
        }
      }
    }
  ]
}'

Data Management

Data Lifecycle Management

Data Retention Policies

data_retention:
    user_activity_logs: 90_days
    audit_logs: 7_years
    analytics_data: 3_years
    user_content: until_user_deletion
    financial_records: 7_years

automated_cleanup:
    enabled: true
    schedule: "0 2 * * 0" # Weekly at 2 AM Sunday
    notification: data-team@v01t.io

Data Backup & Recovery

# Automated backup script
#!/bin/bash
DATE=$(date +%Y%m%d)

# Database backup
pg_dump -h $DB_HOST -U $DB_USER v01t_production | gzip > backup-$DATE.sql.gz
aws s3 cp backup-$DATE.sql.gz s3://v01t-backups/daily/

# File storage backup
aws s3 sync s3://v01t-production-files s3://v01t-backup-files --delete

# Configuration backup
kubectl get all --all-namespaces -o yaml > k8s-backup-$DATE.yaml
aws s3 cp k8s-backup-$DATE.yaml s3://v01t-backups/config/

Data Privacy & Compliance

# Data subject access request
./scripts/gdpr-data-export.sh --user-id $USER_ID --output-format json

# Right to be forgotten
./scripts/gdpr-data-deletion.sh --user-id $USER_ID --confirm

# Data processing audit
./scripts/gdpr-audit-trail.sh --start-date 2025-01-01 --end-date 2025-12-31

Data Encryption

# Encryption at rest
database:
  encryption: AES-256
  key_management: AWS_KMS

storage:
  s3_encryption: SSE-S3
  ebs_encryption: true

# Encryption in transit
api:
  tls_version: "1.3"
  cipher_suites: ["TLS_AES_256_GCM_SHA384", "TLS_CHACHA20_POLY1305_SHA256"]

database:
  ssl_mode: require
  ssl_cert: /etc/ssl/certs/db-client.crt

Contact Information & Escalation

On-Call Rotation

Primary: engineering-oncall@v01t.io
Secondary: infrastructure-oncall@v01t.io
Executive: exec-escalation@v01t.io

Emergency Contacts

CTO: +1-555-0001 (24/7)
VP Engineering: +1-555-0002
Security Lead: +1-555-0003
Database Admin: +1-555-0004

Service Vendors

AWS Support: Enterprise tier, 15-minute SLA
DataDog: Priority support, 1-hour SLA
CloudFlare: Enterprise support, 1-hour SLA

Last Updated: 2025-10-31
Next Review: 2025-11-30
Document Owner: VP Engineering

Business Case & ROI Model

n8n Workflow Specifications

​Operational Runbooks - v01t.io Production Environment

​Table of Contents

​System Architecture Overview

​Production Environment Architecture

​Service Dependencies Map

​Deployment Procedures

​Standard Deployment Process

​Pre-Deployment Checklist

​Blue-Green Deployment Steps

​Emergency Hotfix Procedure

​Monitoring & Alerting

​Key Performance Indicators (KPIs)

​System Health Metrics

​Business Metrics

​Alerting Rules (Prometheus)

​Dashboard Configuration

​Executive Dashboard (Grafana)

​Engineering Dashboard

​Operations Dashboard

​Incident Response

​Severity Levels

​Severity 1 (Critical)

​Severity 2 (High)

​Severity 3 (Medium)

​Severity 4 (Low)

​Incident Response Procedures

​Step 1: Detection & Alert

​Step 2: Initial Response (War Room)

​Step 3: Investigation & Mitigation

​Step 4: Communication Plan

​Step 5: Resolution & Post-Mortem

​Disaster Recovery

​Recovery Time Objectives (RTO) & Recovery Point Objectives (RPO)

​Backup Strategy

​Database Backups

​Application State Backups

​Failover Procedures

​Automated Failover (RTO < 15 minutes)

​Manual Failover (RTO < 2 hours)

​Testing Schedule

​Performance Optimization

​Performance Monitoring

​Application Performance Monitoring (APM)

​Database Performance Optimization

​Caching Strategy

​Auto-Scaling Configuration

​Horizontal Pod Autoscaler (HPA)

​Database Auto-Scaling

​Security Operations

​Security Monitoring

​SIEM Configuration (Splunk/ELK)

​Vulnerability Scanning

​Access Control

​Role-Based Access Control (RBAC)

​Multi-Factor Authentication

​Data Management

​Data Lifecycle Management

​Data Retention Policies

​Data Backup & Recovery

​Data Privacy & Compliance

​GDPR Compliance Procedures

​Data Encryption

​Contact Information & Escalation

​On-Call Rotation

​Emergency Contacts

​Service Vendors

Operational Runbooks - v01t.io Production Environment

Table of Contents

System Architecture Overview

Production Environment Architecture

Service Dependencies Map

Deployment Procedures

Standard Deployment Process

Pre-Deployment Checklist

Blue-Green Deployment Steps

Emergency Hotfix Procedure

Monitoring & Alerting

Key Performance Indicators (KPIs)

System Health Metrics

Business Metrics

Alerting Rules (Prometheus)

Dashboard Configuration

Executive Dashboard (Grafana)

Engineering Dashboard

Operations Dashboard

Incident Response

Severity Levels

Severity 1 (Critical)

Severity 2 (High)

Severity 3 (Medium)

Severity 4 (Low)

Incident Response Procedures

Step 1: Detection & Alert

Step 2: Initial Response (War Room)

Step 3: Investigation & Mitigation

Step 4: Communication Plan

Step 5: Resolution & Post-Mortem

Disaster Recovery

Recovery Time Objectives (RTO) & Recovery Point Objectives (RPO)

Backup Strategy

Database Backups

Application State Backups

Failover Procedures

Automated Failover (RTO < 15 minutes)

Manual Failover (RTO < 2 hours)

Testing Schedule

Performance Optimization

Performance Monitoring

Application Performance Monitoring (APM)

Database Performance Optimization

Caching Strategy

Auto-Scaling Configuration

Horizontal Pod Autoscaler (HPA)

Database Auto-Scaling

Security Operations

Security Monitoring

SIEM Configuration (Splunk/ELK)

Vulnerability Scanning

Access Control

Role-Based Access Control (RBAC)

Multi-Factor Authentication

Data Management

Data Lifecycle Management

Data Retention Policies

Data Backup & Recovery

Data Privacy & Compliance

GDPR Compliance Procedures

Data Encryption

Contact Information & Escalation

On-Call Rotation

Emergency Contacts

Service Vendors