Skip to main content

Overview

This runbook covers operational procedures for backing up SO1 platform data, executing disaster recovery protocols, and ensuring business continuity. These procedures minimize data loss and downtime in the event of failures, data corruption, or catastrophic incidents. Purpose: Provide step-by-step instructions for data protection, backup management, and disaster recovery Scope: Database backups, configuration backups, disaster recovery testing, restoration procedures Target Audience: SREs, DevOps engineers, platform operators, incident commanders

Prerequisites

  • Railway project access (database services)
  • PostgreSQL database access (admin privileges)
  • AWS S3 or backup storage access
  • Control Plane API access (CONTROL_PLANE_API_KEY)
  • GitHub repository admin access (configuration backups)
  • Railway CLI (railway command)
  • PostgreSQL client (psql, pg_dump, pg_restore)
  • AWS CLI (for S3 backups)
  • curl or API client
  • jq for JSON parsing
  • Understanding of PostgreSQL backup mechanisms
  • Familiarity with SO1 database schema
  • Basic knowledge of disaster recovery concepts (RPO, RTO)
  • Understanding of Railway platform architecture

Procedure 1: Create Database Backup

Step 1: Identify Databases to Back Up

# List all databases in SO1 platform
DATABASES=(
  "control-plane-db"      # Control Plane API data
  "n8n-db"                # n8n workflow data
  "veritas-db"            # Veritas prompt library
)

# Get database connection strings from Railway
for db in "${DATABASES[@]}"; do
  echo "=== $db ==="
  railway variables --service $db | grep DATABASE_URL
done

Step 2: Create Manual Backup

# Backup Control Plane database
export DATABASE_URL=$(railway variables --service control-plane-db | grep DATABASE_URL | cut -d'=' -f2)

# Create backup with pg_dump
BACKUP_FILE="backup_control_plane_$(date +%Y%m%d_%H%M%S).sql"

pg_dump "$DATABASE_URL" \
  --format=custom \
  --compress=9 \
  --verbose \
  --file="$BACKUP_FILE"

# Verify backup file created
ls -lh "$BACKUP_FILE"

# Calculate checksum
sha256sum "$BACKUP_FILE" > "${BACKUP_FILE}.sha256"

Step 3: Upload Backup to Storage

# Upload to S3 (or other cloud storage)
aws s3 cp "$BACKUP_FILE" \
  s3://so1-backups/databases/control-plane/ \
  --storage-class STANDARD_IA \
  --metadata "source=control-plane-db,timestamp=$(date -Iseconds),checksum=$(cat ${BACKUP_FILE}.sha256)"

# Upload checksum
aws s3 cp "${BACKUP_FILE}.sha256" \
  s3://so1-backups/databases/control-plane/

# Verify upload
aws s3 ls s3://so1-backups/databases/control-plane/ | grep "$BACKUP_FILE"

# Clean up local backup (optional, after verification)
# rm "$BACKUP_FILE" "${BACKUP_FILE}.sha256"

Step 4: Automate Backup with n8n

# Create automated backup workflow
curl -X POST https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Database Backup - Daily",
    "nodes": [
      {
        "name": "Schedule",
        "type": "n8n-nodes-base.scheduleTrigger",
        "parameters": {
          "cronExpression": "0 2 * * *"
        }
      },
      {
        "name": "Trigger Backup",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/api/v1/admin/backup/create",
          "method": "POST",
          "authentication": "genericCredentialType",
          "headers": {
            "Authorization": "Bearer {{$env.CONTROL_PLANE_API_KEY}}"
          },
          "jsonParameters": true,
          "bodyParameters": {
            "databases": ["control-plane-db", "n8n-db", "veritas-db"],
            "storage": "s3://so1-backups/databases/",
            "retention_days": 30
          }
        }
      },
      {
        "name": "Verify Backup",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/api/v1/admin/backup/verify",
          "method": "POST",
          "bodyParameters": {
            "backup_id": "={{$json.backup_id}}"
          }
        }
      },
      {
        "name": "Notify Success",
        "type": "n8n-nodes-base.slack",
        "parameters": {
          "channel": "#ops-notifications",
          "text": "✅ Daily database backup completed successfully\nBackup ID: {{$json.backup_id}}\nSize: {{$json.size_mb}}MB"
        }
      }
    ],
    "active": true
  }'

Procedure 2: Restore Database from Backup

Step 1: Identify Backup to Restore

# List available backups
aws s3 ls s3://so1-backups/databases/control-plane/ --recursive | sort -r

# Get specific backup
BACKUP_FILE="backup_control_plane_20260310_020000.sql"

# Download backup
aws s3 cp "s3://so1-backups/databases/control-plane/${BACKUP_FILE}" .

# Download and verify checksum
aws s3 cp "s3://so1-backups/databases/control-plane/${BACKUP_FILE}.sha256" .
sha256sum -c "${BACKUP_FILE}.sha256"

Step 2: Prepare for Restoration

Database restoration is a destructive operation. Always test in a staging environment first and notify the team before restoring production databases.
# Create snapshot of current database (safety measure)
SNAPSHOT_FILE="snapshot_before_restore_$(date +%Y%m%d_%H%M%S).sql"
pg_dump "$DATABASE_URL" --format=custom --file="$SNAPSHOT_FILE"

# Terminate active connections to database
psql "$DATABASE_URL" -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();"

# Optionally: Create new database for restoration
# psql "$DATABASE_URL" -c "CREATE DATABASE control_plane_restore;"

Step 3: Execute Restoration

# Restore database from backup
pg_restore \
  --dbname="$DATABASE_URL" \
  --clean \
  --if-exists \
  --verbose \
  "$BACKUP_FILE"

# Check restoration status
echo $?  # Should be 0 for success

# Verify record counts
psql "$DATABASE_URL" -c "SELECT schemaname, tablename, n_live_tup as rows FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"

Step 4: Verify Restoration

# Run health checks
curl -s https://control-plane.so1.io/health | jq '.'

# Verify critical data
psql "$DATABASE_URL" <<EOF
-- Check workflows exist
SELECT COUNT(*) as workflow_count FROM workflows;

-- Check agents exist
SELECT COUNT(*) as agent_count FROM agents;

-- Check recent activity
SELECT COUNT(*) as recent_executions FROM agent_executions WHERE created_at > NOW() - INTERVAL '1 day';
EOF

# Test API functionality
curl -s https://control-plane.so1.io/api/v1/workflows \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq 'length'

Step 5: Resume Normal Operations

# Restart services if needed
railway service restart control-plane-api

# Notify team
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H "Content-Type: application/json" \
  -d '{
    "text": "✅ Database restoration completed successfully",
    "attachments": [{
      "color": "good",
      "fields": [
        {"title": "Database", "value": "control-plane-db", "short": true},
        {"title": "Backup", "value": "'"$BACKUP_FILE"'", "short": true},
        {"title": "Timestamp", "value": "'"$(date -Iseconds)"'", "short": false}
      ]
    }]
  }'

Procedure 3: Backup Configuration and Code

Step 1: Backup Railway Configuration

# Export Railway service configurations
railway service list --json > railway_services_backup_$(date +%Y%m%d).json

# Backup environment variables (encrypted)
for service in control-plane-api console n8n; do
  railway variables --service $service --json > "railway_vars_${service}_$(date +%Y%m%d).json"
done

# Store in secure location (encrypted)
tar -czf railway_config_backup_$(date +%Y%m%d).tar.gz railway_*.json
gpg --encrypt --recipient ops@so1.io railway_config_backup_$(date +%Y%m%d).tar.gz

# Upload to secure storage
aws s3 cp railway_config_backup_$(date +%Y%m%d).tar.gz.gpg \
  s3://so1-backups/configurations/ \
  --sse aws:kms

Step 2: Backup Veritas Prompt Library

# Clone Veritas repository (if not already local)
git clone https://github.com/so1-io/veritas.git /tmp/veritas-backup

# Create archive
cd /tmp/veritas-backup
git archive --format=tar.gz --prefix=veritas/ HEAD > ../veritas_backup_$(date +%Y%m%d).tar.gz

# Upload to storage
aws s3 cp ../veritas_backup_$(date +%Y%m%d).tar.gz \
  s3://so1-backups/veritas/

# Verify backup
aws s3 ls s3://so1-backups/veritas/ | tail -1

Step 3: Backup n8n Workflows

# Export all n8n workflows
curl -s https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  | jq '.' > n8n_workflows_backup_$(date +%Y%m%d).json

# Upload to storage
aws s3 cp n8n_workflows_backup_$(date +%Y%m%d).json \
  s3://so1-backups/n8n-workflows/

# Alternative: Backup n8n database (includes credentials, executions)
# See Procedure 1 for database backup

Procedure 4: Test Disaster Recovery

Disaster Recovery (DR) testing should be performed quarterly to ensure procedures are current and effective.

Step 1: Define DR Test Scope

interface DRTest {
  name: string;
  scenario: string;
  objectives: string[];
  success_criteria: string[];
  estimated_duration: string;
  team: string[];
}

const drTest: DRTest = {
  name: "Q1 2026 DR Test",
  scenario: "Complete data center failure - restore all services from backups",
  objectives: [
    "Restore Control Plane database from latest backup",
    "Restore n8n workflows and configurations",
    "Restore Veritas prompt library",
    "Verify all services operational",
  ],
  success_criteria: [
    "RTO < 4 hours (time to restore services)",
    "RPO < 24 hours (maximum data loss)",
    "All critical services passing health checks",
    "Sample workflows execute successfully",
  ],
  estimated_duration: "4 hours",
  team: ["sre-lead", "devops-engineer", "platform-architect"],
};

Step 2: Create Test Environment

# Create isolated Railway environment for testing
railway environment create dr-test-q1-2026

# Deploy services to test environment
railway service deploy control-plane-api --environment dr-test-q1-2026

# DO NOT use production environment for DR testing

Step 3: Execute DR Test

# Start DR test timer
DR_START=$(date +%s)

# 1. Restore databases
echo "Step 1: Restoring databases..."
# Use Procedure 2 to restore from latest backup

# 2. Restore configurations
echo "Step 2: Restoring configurations..."
# Download and decrypt Railway config backup
aws s3 cp s3://so1-backups/configurations/railway_config_backup_latest.tar.gz.gpg .
gpg --decrypt railway_config_backup_latest.tar.gz.gpg | tar -xzf -

# Apply configurations
for config in railway_vars_*.json; do
  service=$(echo $config | cut -d'_' -f3)
  jq -r 'to_entries[] | "\(.key)=\(.value)"' $config | while read var; do
    railway variables set $var --service $service --environment dr-test-q1-2026
  done
done

# 3. Restore Veritas
echo "Step 3: Restoring Veritas..."
aws s3 cp s3://so1-backups/veritas/veritas_backup_latest.tar.gz .
tar -xzf veritas_backup_latest.tar.gz
cd veritas && git push origin --all --force  # Push to test repo, not production

# 4. Verify services
echo "Step 4: Verifying services..."
for service in control-plane console n8n; do
  health_url="https://${service}.dr-test.so1.io/health"
  status=$(curl -s -o /dev/null -w "%{http_code}" $health_url)
  echo "$service: $status"
done

# Calculate RTO
DR_END=$(date +%s)
RTO=$((DR_END - DR_START))
echo "RTO: $((RTO / 60)) minutes"

Step 4: Document Test Results

# Generate DR test report
cat > dr_test_report_$(date +%Y%m%d).md <<EOF
# Disaster Recovery Test Report

**Date**: $(date -Iseconds)
**Test Name**: Q1 2026 DR Test
**Scenario**: Complete data center failure

## Results

- **RTO Achieved**: $((RTO / 60)) minutes (Target: &lt;240 minutes)
- **RPO**: 12 hours (last backup: $(aws s3 ls s3://so1-backups/databases/control-plane/ | tail -1 | awk '{print $1, $2}'))
- **Services Restored**: 3/3
- **Data Integrity**: ✅ Verified
- **Functional Tests**: ✅ Passed

## Issues Encountered

1. Database restoration took longer than expected (90 minutes)
   - Resolution: Need to optimize backup compression
2. Railway environment variables required manual re-entry
   - Resolution: Automate variable restoration

## Recommendations

1. Increase backup frequency to every 6 hours
2. Automate Railway configuration restoration
3. Document dependencies between services
4. Schedule next DR test for Q2 2026

## Sign-off

- SRE Lead: _______________
- DevOps Engineer: _______________
- Platform Architect: _______________
EOF

# Upload report
aws s3 cp dr_test_report_$(date +%Y%m%d).md s3://so1-backups/dr-reports/

Procedure 5: Emergency Recovery

Step 1: Assess Incident Severity

When disaster is detected:
# Quickly assess what's down
SERVICES=("control-plane" "console" "n8n")
for service in "${SERVICES[@]}"; do
  status=$(curl -s -o /dev/null -w "%{http_code}" "https://${service}.so1.io/health")
  if [ "$status" != "200" ]; then
    echo "🔴 $service: DOWN ($status)"
  else
    echo "✅ $service: UP"
  fi
done

# Check database connectivity
psql "$DATABASE_URL" -c "SELECT 1" 2>&1 | grep -q "ERROR" && echo "🔴 Database: DOWN" || echo "✅ Database: UP"

Step 2: Declare DR Incident

Only declare DR incident for catastrophic failures (multiple services down, data corruption, region outage). For single service failures, use standard incident response.
# Notify team via Slack
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H "Content-Type: application/json" \
  -d '{
    "text": "🚨 DISASTER RECOVERY INCIDENT DECLARED",
    "attachments": [{
      "color": "danger",
      "title": "DR Incident: Complete Platform Outage",
      "fields": [
        {"title": "Severity", "value": "SEV0", "short": true},
        {"title": "Incident Commander", "value": "@oncall-sre", "short": true},
        {"title": "Status", "value": "Recovery in progress", "short": false}
      ]
    }]
  }'

# Create incident channel
# Manual step: Create #incident-dr-YYYYMMDD channel

Step 3: Execute Emergency Restoration

Follow Procedure 2 (Database Restoration) with these modifications:
# Use most recent verified backup
LATEST_BACKUP=$(aws s3 ls s3://so1-backups/databases/control-plane/ | grep ".sql" | sort -r | head -1 | awk '{print $4}')

# Parallel restoration (if multiple DBs affected)
(pg_restore --dbname="$CONTROL_PLANE_DB_URL" "$LATEST_BACKUP") &
(pg_restore --dbname="$N8N_DB_URL" "n8n_backup.sql") &
(pg_restore --dbname="$VERITAS_DB_URL" "veritas_backup.sql") &

# Wait for all restorations
wait

# Verify critical data
psql "$CONTROL_PLANE_DB_URL" -c "SELECT COUNT(*) FROM workflows;"
psql "$N8N_DB_URL" -c "SELECT COUNT(*) FROM workflow_entity;"
psql "$VERITAS_DB_URL" -c "SELECT COUNT(*) FROM prompts;"

Step 4: Validate Recovery

# Run smoke tests
curl -s https://control-plane.so1.io/api/v1/workflows | jq 'length'
curl -s https://console.so1.io | grep -q "SO1 Platform"
curl -s https://n8n.so1.io/healthz | jq '.status'

# Test critical workflow
curl -X POST https://control-plane.so1.io/api/v1/agents/workflow-architect/execute \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -d '{"input": "test recovery"}' | jq '.status'

Step 5: Communicate Status

# Update incident channel
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H "Content-Type: application/json" \
  -d '{
    "text": "✅ DISASTER RECOVERY COMPLETED",
    "attachments": [{
      "color": "good",
      "title": "Services Restored",
      "fields": [
        {"title": "RTO", "value": "3.5 hours", "short": true},
        {"title": "Data Loss", "value": "6 hours (RPO)", "short": true},
        {"title": "Services", "value": "All services operational", "short": false}
      ]
    }]
  }'

# Create postmortem (see Incident Response runbook)

Verification Checklist

After completing backup/recovery operations, verify:

Troubleshooting

IssueSymptomsRoot CauseResolution
Backup Failspg_dump exits with errorInsufficient disk space, connection timeoutCheck disk space, increase timeout, verify DB connection
S3 Upload FailsAWS CLI returns 403/500Invalid credentials, bucket policyVerify AWS credentials, check bucket permissions
Restoration Slowpg_restore takes >2 hoursLarge database, network latencyUse --jobs flag for parallel restore, restore from same region
Data Integrity IssuesCorrupted data after restoreBad backup, incomplete restoreVerify backup checksum before restore, check pg_restore logs
Missing Recent DataLatest transactions not in backupBackup timing, RPO exceededRestore from more recent backup, review backup frequency
Service Won’t StartHealth checks fail after restoreSchema mismatch, missing migrationsCheck migration status, run pending migrations
Checksum MismatchBackup file checksum doesn’t matchFile corruption during transferRe-download backup, verify S3 object integrity

Detailed Troubleshooting: Restoration Failed

# Check pg_restore logs
pg_restore --dbname="$DATABASE_URL" "$BACKUP_FILE" 2>&1 | tee restore.log
grep ERROR restore.log

# Common errors:

# 1. "ERROR: relation already exists"
# Solution: Add --clean flag to drop existing objects
pg_restore --dbname="$DATABASE_URL" --clean --if-exists "$BACKUP_FILE"

# 2. "ERROR: permission denied"
# Solution: Ensure database user has sufficient privileges
psql "$DATABASE_URL" -c "ALTER USER dbuser WITH SUPERUSER;"

# 3. "ERROR: could not open file"
# Solution: Verify backup file integrity
sha256sum -c "${BACKUP_FILE}.sha256"

# 4. Restoration hangs
# Solution: Use verbose mode and check for blocking queries
pg_restore --dbname="$DATABASE_URL" --verbose "$BACKUP_FILE"

# Check for locks
psql "$DATABASE_URL" -c "SELECT * FROM pg_locks WHERE NOT granted;"


Best Practices

Backup Strategy

  1. Follow 3-2-1 rule: 3 copies, 2 different media, 1 offsite
  2. Automate backups: Never rely on manual backups alone
  3. Verify backups: Test restoration quarterly
  4. Encrypt sensitive data: Use encryption at rest and in transit
  5. Document procedures: Keep runbooks updated

Retention Policy

  1. Daily backups: Keep for 30 days
  2. Weekly backups: Keep for 12 weeks
  3. Monthly backups: Keep for 12 months
  4. Yearly backups: Keep for 7 years (compliance)
  5. Auto-delete old backups: Enforce retention with lifecycle policies

Disaster Recovery

  1. Define RTO/RPO: Recovery Time Objective <4h, Recovery Point Objective <24h
  2. Test regularly: Quarterly DR drills
  3. Document everything: Clear procedures, contact lists
  4. Automate where possible: Reduce human error
  5. Plan for worst case: Assume total failure, no access to primary systems

Security

  1. Encrypt backups: Use GPG or cloud-native encryption
  2. Restrict access: Limit who can restore production data
  3. Audit backup access: Log all backup downloads/restorations
  4. Rotate credentials: Change backup storage credentials quarterly
  5. Separate accounts: Use different AWS accounts for production and backups