Backup & Disaster Recovery - SO1 Documentation

Overview

This runbook covers operational procedures for backing up SO1 platform data, executing disaster recovery protocols, and ensuring business continuity. These procedures minimize data loss and downtime in the event of failures, data corruption, or catastrophic incidents. Purpose: Provide step-by-step instructions for data protection, backup management, and disaster recovery Scope: Database backups, configuration backups, disaster recovery testing, restoration procedures Target Audience: SREs, DevOps engineers, platform operators, incident commanders

Prerequisites

Required Access

Railway project access (database services)
PostgreSQL database access (admin privileges)
AWS S3 or backup storage access
Control Plane API access (CONTROL_PLANE_API_KEY)
GitHub repository admin access (configuration backups)

Required Tools

Railway CLI (railway command)
PostgreSQL client (psql, pg_dump, pg_restore)
AWS CLI (for S3 backups)
curl or API client
jq for JSON parsing

Required Knowledge

Understanding of PostgreSQL backup mechanisms
Familiarity with SO1 database schema
Basic knowledge of disaster recovery concepts (RPO, RTO)
Understanding of Railway platform architecture

Procedure 1: Create Database Backup

Step 1: Identify Databases to Back Up

# List all databases in SO1 platform
DATABASES=(
  "control-plane-db"      # Control Plane API data
  "n8n-db"                # n8n workflow data
  "veritas-db"            # Veritas prompt library
)

# Get database connection strings from Railway
for db in "${DATABASES[@]}"; do
  echo "=== $db ==="
  railway variables --service $db | grep DATABASE_URL
done

Step 2: Create Manual Backup

# Backup Control Plane database
export DATABASE_URL=$(railway variables --service control-plane-db | grep DATABASE_URL | cut -d'=' -f2)

# Create backup with pg_dump
BACKUP_FILE="backup_control_plane_$(date +%Y%m%d_%H%M%S).sql"

pg_dump "$DATABASE_URL" \
  --format=custom \
  --compress=9 \
  --verbose \
  --file="$BACKUP_FILE"

# Verify backup file created
ls -lh "$BACKUP_FILE"

# Calculate checksum
sha256sum "$BACKUP_FILE" > "${BACKUP_FILE}.sha256"

Step 3: Upload Backup to Storage

# Upload to S3 (or other cloud storage)
aws s3 cp "$BACKUP_FILE" \
  s3://so1-backups/databases/control-plane/ \
  --storage-class STANDARD_IA \
  --metadata "source=control-plane-db,timestamp=$(date -Iseconds),checksum=$(cat ${BACKUP_FILE}.sha256)"

# Upload checksum
aws s3 cp "${BACKUP_FILE}.sha256" \
  s3://so1-backups/databases/control-plane/

# Verify upload
aws s3 ls s3://so1-backups/databases/control-plane/ | grep "$BACKUP_FILE"

# Clean up local backup (optional, after verification)
# rm "$BACKUP_FILE" "${BACKUP_FILE}.sha256"

Step 4: Automate Backup with n8n

# Create automated backup workflow
curl -X POST https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Database Backup - Daily",
    "nodes": [
      {
        "name": "Schedule",
        "type": "n8n-nodes-base.scheduleTrigger",
        "parameters": {
          "cronExpression": "0 2 * * *"
        }
      },
      {
        "name": "Trigger Backup",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/api/v1/admin/backup/create",
          "method": "POST",
          "authentication": "genericCredentialType",
          "headers": {
            "Authorization": "Bearer {{$env.CONTROL_PLANE_API_KEY}}"
          },
          "jsonParameters": true,
          "bodyParameters": {
            "databases": ["control-plane-db", "n8n-db", "veritas-db"],
            "storage": "s3://so1-backups/databases/",
            "retention_days": 30
          }
        }
      },
      {
        "name": "Verify Backup",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/api/v1/admin/backup/verify",
          "method": "POST",
          "bodyParameters": {
            "backup_id": "={{$json.backup_id}}"
          }
        }
      },
      {
        "name": "Notify Success",
        "type": "n8n-nodes-base.slack",
        "parameters": {
          "channel": "#ops-notifications",
          "text": "✅ Daily database backup completed successfully\nBackup ID: {{$json.backup_id}}\nSize: {{$json.size_mb}}MB"
        }
      }
    ],
    "active": true
  }'

Procedure 2: Restore Database from Backup

Step 1: Identify Backup to Restore

# List available backups
aws s3 ls s3://so1-backups/databases/control-plane/ --recursive | sort -r

# Get specific backup
BACKUP_FILE="backup_control_plane_20260310_020000.sql"

# Download backup
aws s3 cp "s3://so1-backups/databases/control-plane/${BACKUP_FILE}" .

# Download and verify checksum
aws s3 cp "s3://so1-backups/databases/control-plane/${BACKUP_FILE}.sha256" .
sha256sum -c "${BACKUP_FILE}.sha256"

Step 2: Prepare for Restoration

Database restoration is a destructive operation. Always test in a staging environment first and notify the team before restoring production databases.

# Create snapshot of current database (safety measure)
SNAPSHOT_FILE="snapshot_before_restore_$(date +%Y%m%d_%H%M%S).sql"
pg_dump "$DATABASE_URL" --format=custom --file="$SNAPSHOT_FILE"

# Terminate active connections to database
psql "$DATABASE_URL" -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();"

# Optionally: Create new database for restoration
# psql "$DATABASE_URL" -c "CREATE DATABASE control_plane_restore;"

Step 3: Execute Restoration

# Restore database from backup
pg_restore \
  --dbname="$DATABASE_URL" \
  --clean \
  --if-exists \
  --verbose \
  "$BACKUP_FILE"

# Check restoration status
echo $?  # Should be 0 for success

# Verify record counts
psql "$DATABASE_URL" -c "SELECT schemaname, tablename, n_live_tup as rows FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"

Step 4: Verify Restoration

# Run health checks
curl -s https://control-plane.so1.io/health | jq '.'

# Verify critical data
psql "$DATABASE_URL" <<EOF
-- Check workflows exist
SELECT COUNT(*) as workflow_count FROM workflows;

-- Check agents exist
SELECT COUNT(*) as agent_count FROM agents;

-- Check recent activity
SELECT COUNT(*) as recent_executions FROM agent_executions WHERE created_at > NOW() - INTERVAL '1 day';
EOF

# Test API functionality
curl -s https://control-plane.so1.io/api/v1/workflows \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq 'length'

Step 5: Resume Normal Operations

# Restart services if needed
railway service restart control-plane-api

# Notify team
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H "Content-Type: application/json" \
  -d '{
    "text": "✅ Database restoration completed successfully",
    "attachments": [{
      "color": "good",
      "fields": [
        {"title": "Database", "value": "control-plane-db", "short": true},
        {"title": "Backup", "value": "'"$BACKUP_FILE"'", "short": true},
        {"title": "Timestamp", "value": "'"$(date -Iseconds)"'", "short": false}
      ]
    }]
  }'

Procedure 3: Backup Configuration and Code

Step 1: Backup Railway Configuration

# Export Railway service configurations
railway service list --json > railway_services_backup_$(date +%Y%m%d).json

# Backup environment variables (encrypted)
for service in control-plane-api console n8n; do
  railway variables --service $service --json > "railway_vars_${service}_$(date +%Y%m%d).json"
done

# Store in secure location (encrypted)
tar -czf railway_config_backup_$(date +%Y%m%d).tar.gz railway_*.json
gpg --encrypt --recipient ops@so1.io railway_config_backup_$(date +%Y%m%d).tar.gz

# Upload to secure storage
aws s3 cp railway_config_backup_$(date +%Y%m%d).tar.gz.gpg \
  s3://so1-backups/configurations/ \
  --sse aws:kms

Step 2: Backup Veritas Prompt Library

# Clone Veritas repository (if not already local)
git clone https://github.com/so1-io/veritas.git /tmp/veritas-backup

# Create archive
cd /tmp/veritas-backup
git archive --format=tar.gz --prefix=veritas/ HEAD > ../veritas_backup_$(date +%Y%m%d).tar.gz

# Upload to storage
aws s3 cp ../veritas_backup_$(date +%Y%m%d).tar.gz \
  s3://so1-backups/veritas/

# Verify backup
aws s3 ls s3://so1-backups/veritas/ | tail -1

Step 3: Backup n8n Workflows

# Export all n8n workflows
curl -s https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  | jq '.' > n8n_workflows_backup_$(date +%Y%m%d).json

# Upload to storage
aws s3 cp n8n_workflows_backup_$(date +%Y%m%d).json \
  s3://so1-backups/n8n-workflows/

# Alternative: Backup n8n database (includes credentials, executions)
# See Procedure 1 for database backup

Procedure 4: Test Disaster Recovery

Disaster Recovery (DR) testing should be performed quarterly to ensure procedures are current and effective.

Step 1: Define DR Test Scope

interface DRTest {
  name: string;
  scenario: string;
  objectives: string[];
  success_criteria: string[];
  estimated_duration: string;
  team: string[];
}

const drTest: DRTest = {
  name: "Q1 2026 DR Test",
  scenario: "Complete data center failure - restore all services from backups",
  objectives: [
    "Restore Control Plane database from latest backup",
    "Restore n8n workflows and configurations",
    "Restore Veritas prompt library",
    "Verify all services operational",
  ],
  success_criteria: [
    "RTO < 4 hours (time to restore services)",
    "RPO < 24 hours (maximum data loss)",
    "All critical services passing health checks",
    "Sample workflows execute successfully",
  ],
  estimated_duration: "4 hours",
  team: ["sre-lead", "devops-engineer", "platform-architect"],
};

Step 2: Create Test Environment

# Create isolated Railway environment for testing
railway environment create dr-test-q1-2026

# Deploy services to test environment
railway service deploy control-plane-api --environment dr-test-q1-2026

# DO NOT use production environment for DR testing

Step 3: Execute DR Test

# Start DR test timer
DR_START=$(date +%s)

# 1. Restore databases
echo "Step 1: Restoring databases..."
# Use Procedure 2 to restore from latest backup

# 2. Restore configurations
echo "Step 2: Restoring configurations..."
# Download and decrypt Railway config backup
aws s3 cp s3://so1-backups/configurations/railway_config_backup_latest.tar.gz.gpg .
gpg --decrypt railway_config_backup_latest.tar.gz.gpg | tar -xzf -

# Apply configurations
for config in railway_vars_*.json; do
  service=$(echo $config | cut -d'_' -f3)
  jq -r 'to_entries[] | "\(.key)=\(.value)"' $config | while read var; do
    railway variables set $var --service $service --environment dr-test-q1-2026
  done
done

# 3. Restore Veritas
echo "Step 3: Restoring Veritas..."
aws s3 cp s3://so1-backups/veritas/veritas_backup_latest.tar.gz .
tar -xzf veritas_backup_latest.tar.gz
cd veritas && git push origin --all --force  # Push to test repo, not production

# 4. Verify services
echo "Step 4: Verifying services..."
for service in control-plane console n8n; do
  health_url="https://${service}.dr-test.so1.io/health"
  status=$(curl -s -o /dev/null -w "%{http_code}" $health_url)
  echo "$service: $status"
done

# Calculate RTO
DR_END=$(date +%s)
RTO=$((DR_END - DR_START))
echo "RTO: $((RTO / 60)) minutes"

Step 4: Document Test Results

# Generate DR test report
cat > dr_test_report_$(date +%Y%m%d).md <<EOF
# Disaster Recovery Test Report

**Date**: $(date -Iseconds)
**Test Name**: Q1 2026 DR Test
**Scenario**: Complete data center failure

## Results

- **RTO Achieved**: $((RTO / 60)) minutes (Target: &lt;240 minutes)
- **RPO**: 12 hours (last backup: $(aws s3 ls s3://so1-backups/databases/control-plane/ | tail -1 | awk '{print $1, $2}'))
- **Services Restored**: 3/3
- **Data Integrity**: ✅ Verified
- **Functional Tests**: ✅ Passed

## Issues Encountered

1. Database restoration took longer than expected (90 minutes)
   - Resolution: Need to optimize backup compression
2. Railway environment variables required manual re-entry
   - Resolution: Automate variable restoration

## Recommendations

1. Increase backup frequency to every 6 hours
2. Automate Railway configuration restoration
3. Document dependencies between services
4. Schedule next DR test for Q2 2026

## Sign-off

- SRE Lead: _______________
- DevOps Engineer: _______________
- Platform Architect: _______________
EOF

# Upload report
aws s3 cp dr_test_report_$(date +%Y%m%d).md s3://so1-backups/dr-reports/

Procedure 5: Emergency Recovery

Step 1: Assess Incident Severity

When disaster is detected:

# Quickly assess what's down
SERVICES=("control-plane" "console" "n8n")
for service in "${SERVICES[@]}"; do
  status=$(curl -s -o /dev/null -w "%{http_code}" "https://${service}.so1.io/health")
  if [ "$status" != "200" ]; then
    echo "🔴 $service: DOWN ($status)"
  else
    echo "✅ $service: UP"
  fi
done

# Check database connectivity
psql "$DATABASE_URL" -c "SELECT 1" 2>&1 | grep -q "ERROR" && echo "🔴 Database: DOWN" || echo "✅ Database: UP"

Step 2: Declare DR Incident

Only declare DR incident for catastrophic failures (multiple services down, data corruption, region outage). For single service failures, use standard incident response.

# Notify team via Slack
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H "Content-Type: application/json" \
  -d '{
    "text": "🚨 DISASTER RECOVERY INCIDENT DECLARED",
    "attachments": [{
      "color": "danger",
      "title": "DR Incident: Complete Platform Outage",
      "fields": [
        {"title": "Severity", "value": "SEV0", "short": true},
        {"title": "Incident Commander", "value": "@oncall-sre", "short": true},
        {"title": "Status", "value": "Recovery in progress", "short": false}
      ]
    }]
  }'

# Create incident channel
# Manual step: Create #incident-dr-YYYYMMDD channel

Step 3: Execute Emergency Restoration

Follow Procedure 2 (Database Restoration) with these modifications:

# Use most recent verified backup
LATEST_BACKUP=$(aws s3 ls s3://so1-backups/databases/control-plane/ | grep ".sql" | sort -r | head -1 | awk '{print $4}')

# Parallel restoration (if multiple DBs affected)
(pg_restore --dbname="$CONTROL_PLANE_DB_URL" "$LATEST_BACKUP") &
(pg_restore --dbname="$N8N_DB_URL" "n8n_backup.sql") &
(pg_restore --dbname="$VERITAS_DB_URL" "veritas_backup.sql") &

# Wait for all restorations
wait

# Verify critical data
psql "$CONTROL_PLANE_DB_URL" -c "SELECT COUNT(*) FROM workflows;"
psql "$N8N_DB_URL" -c "SELECT COUNT(*) FROM workflow_entity;"
psql "$VERITAS_DB_URL" -c "SELECT COUNT(*) FROM prompts;"

Step 4: Validate Recovery

# Run smoke tests
curl -s https://control-plane.so1.io/api/v1/workflows | jq 'length'
curl -s https://console.so1.io | grep -q "SO1 Platform"
curl -s https://n8n.so1.io/healthz | jq '.status'

# Test critical workflow
curl -X POST https://control-plane.so1.io/api/v1/agents/workflow-architect/execute \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -d '{"input": "test recovery"}' | jq '.status'

Step 5: Communicate Status

# Update incident channel
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H "Content-Type: application/json" \
  -d '{
    "text": "✅ DISASTER RECOVERY COMPLETED",
    "attachments": [{
      "color": "good",
      "title": "Services Restored",
      "fields": [
        {"title": "RTO", "value": "3.5 hours", "short": true},
        {"title": "Data Loss", "value": "6 hours (RPO)", "short": true},
        {"title": "Services", "value": "All services operational", "short": false}
      ]
    }]
  }'

# Create postmortem (see Incident Response runbook)

Verification Checklist

After completing backup/recovery operations, verify:

Troubleshooting

Issue	Symptoms	Root Cause	Resolution
Backup Fails	pg_dump exits with error	Insufficient disk space, connection timeout	Check disk space, increase timeout, verify DB connection
S3 Upload Fails	AWS CLI returns 403/500	Invalid credentials, bucket policy	Verify AWS credentials, check bucket permissions
Restoration Slow	pg_restore takes >2 hours	Large database, network latency	Use `--jobs` flag for parallel restore, restore from same region
Data Integrity Issues	Corrupted data after restore	Bad backup, incomplete restore	Verify backup checksum before restore, check pg_restore logs
Missing Recent Data	Latest transactions not in backup	Backup timing, RPO exceeded	Restore from more recent backup, review backup frequency
Service Won’t Start	Health checks fail after restore	Schema mismatch, missing migrations	Check migration status, run pending migrations
Checksum Mismatch	Backup file checksum doesn’t match	File corruption during transfer	Re-download backup, verify S3 object integrity

Detailed Troubleshooting: Restoration Failed

# Check pg_restore logs
pg_restore --dbname="$DATABASE_URL" "$BACKUP_FILE" 2>&1 | tee restore.log
grep ERROR restore.log

# Common errors:

# 1. "ERROR: relation already exists"
# Solution: Add --clean flag to drop existing objects
pg_restore --dbname="$DATABASE_URL" --clean --if-exists "$BACKUP_FILE"

# 2. "ERROR: permission denied"
# Solution: Ensure database user has sufficient privileges
psql "$DATABASE_URL" -c "ALTER USER dbuser WITH SUPERUSER;"

# 3. "ERROR: could not open file"
# Solution: Verify backup file integrity
sha256sum -c "${BACKUP_FILE}.sha256"

# 4. Restoration hangs
# Solution: Use verbose mode and check for blocking queries
pg_restore --dbname="$DATABASE_URL" --verbose "$BACKUP_FILE"

# Check for locks
psql "$DATABASE_URL" -c "SELECT * FROM pg_locks WHERE NOT granted;"

Incident Response Runbook

Incident detection and response procedures

Deployment Runbook

Deployment and rollback procedures

Monitoring Runbook

System health monitoring and alerts

DevOps Runbook

Infrastructure operations

Best Practices

Backup Strategy

Follow 3-2-1 rule: 3 copies, 2 different media, 1 offsite
Automate backups: Never rely on manual backups alone
Verify backups: Test restoration quarterly
Encrypt sensitive data: Use encryption at rest and in transit
Document procedures: Keep runbooks updated

Retention Policy

Daily backups: Keep for 30 days
Weekly backups: Keep for 12 weeks
Monthly backups: Keep for 12 months
Yearly backups: Keep for 7 years (compliance)
Auto-delete old backups: Enforce retention with lifecycle policies

Disaster Recovery

Define RTO/RPO: Recovery Time Objective <4h, Recovery Point Objective <24h
Test regularly: Quarterly DR drills
Document everything: Clear procedures, contact lists
Automate where possible: Reduce human error
Plan for worst case: Assume total failure, no access to primary systems

Security

Encrypt backups: Use GPG or cloud-native encryption
Restrict access: Limit who can restore production data
Audit backup access: Log all backup downloads/restorations
Rotate credentials: Change backup storage credentials quarterly
Separate accounts: Use different AWS accounts for production and backups

Runbooks

Domain Runbooks

​Overview

​Prerequisites

​Procedure 1: Create Database Backup

​Step 1: Identify Databases to Back Up

​Step 2: Create Manual Backup

​Step 3: Upload Backup to Storage

​Step 4: Automate Backup with n8n

​Procedure 2: Restore Database from Backup

​Step 1: Identify Backup to Restore

​Step 2: Prepare for Restoration

​Step 3: Execute Restoration

​Step 4: Verify Restoration

​Step 5: Resume Normal Operations

​Procedure 3: Backup Configuration and Code

​Step 1: Backup Railway Configuration

​Step 2: Backup Veritas Prompt Library

​Step 3: Backup n8n Workflows

​Procedure 4: Test Disaster Recovery

​Step 1: Define DR Test Scope

​Step 2: Create Test Environment

​Step 3: Execute DR Test

​Step 4: Document Test Results

​Procedure 5: Emergency Recovery

​Step 1: Assess Incident Severity

​Step 2: Declare DR Incident

​Step 3: Execute Emergency Restoration

​Step 4: Validate Recovery

​Step 5: Communicate Status

​Verification Checklist

​Troubleshooting

​Detailed Troubleshooting: Restoration Failed

​Related Resources

Incident Response Runbook

Deployment Runbook

Monitoring Runbook

DevOps Runbook

​Best Practices

​Backup Strategy

​Retention Policy

​Disaster Recovery

​Security

Overview

Prerequisites

Procedure 1: Create Database Backup

Step 1: Identify Databases to Back Up

Step 2: Create Manual Backup

Step 3: Upload Backup to Storage

Step 4: Automate Backup with n8n

Procedure 2: Restore Database from Backup

Step 1: Identify Backup to Restore

Step 2: Prepare for Restoration

Step 3: Execute Restoration

Step 4: Verify Restoration

Step 5: Resume Normal Operations

Procedure 3: Backup Configuration and Code

Step 1: Backup Railway Configuration

Step 2: Backup Veritas Prompt Library

Step 3: Backup n8n Workflows

Procedure 4: Test Disaster Recovery

Step 1: Define DR Test Scope

Step 2: Create Test Environment

Step 3: Execute DR Test

Step 4: Document Test Results

Procedure 5: Emergency Recovery

Step 1: Assess Incident Severity

Step 2: Declare DR Incident

Step 3: Execute Emergency Restoration

Step 4: Validate Recovery

Step 5: Communicate Status

Verification Checklist

Troubleshooting

Detailed Troubleshooting: Restoration Failed

Related Resources

Best Practices

Backup Strategy

Retention Policy

Disaster Recovery

Security