Overview
This runbook covers operational procedures for backing up SO1 platform data, executing disaster recovery protocols, and ensuring business continuity. These procedures minimize data loss and downtime in the event of failures, data corruption, or catastrophic incidents.
Purpose : Provide step-by-step instructions for data protection, backup management, and disaster recovery
Scope : Database backups, configuration backups, disaster recovery testing, restoration procedures
Target Audience : SREs, DevOps engineers, platform operators, incident commanders
Prerequisites
Railway project access (database services)
PostgreSQL database access (admin privileges)
AWS S3 or backup storage access
Control Plane API access (CONTROL_PLANE_API_KEY)
GitHub repository admin access (configuration backups)
Understanding of PostgreSQL backup mechanisms
Familiarity with SO1 database schema
Basic knowledge of disaster recovery concepts (RPO, RTO)
Understanding of Railway platform architecture
Procedure 1: Create Database Backup
Step 1: Identify Databases to Back Up
# List all databases in SO1 platform
DATABASES = (
"control-plane-db" # Control Plane API data
"n8n-db" # n8n workflow data
"veritas-db" # Veritas prompt library
)
# Get database connection strings from Railway
for db in "${ DATABASES [ @ ]}" ; do
echo "=== $db ==="
railway variables --service $db | grep DATABASE_URL
done
Step 2: Create Manual Backup
# Backup Control Plane database
export DATABASE_URL = $( railway variables --service control-plane-db | grep DATABASE_URL | cut -d '=' -f2 )
# Create backup with pg_dump
BACKUP_FILE = "backup_control_plane_$( date +%Y%m%d_%H%M%S).sql"
pg_dump " $DATABASE_URL " \
--format=custom \
--compress=9 \
--verbose \
--file= " $BACKUP_FILE "
# Verify backup file created
ls -lh " $BACKUP_FILE "
# Calculate checksum
sha256sum " $BACKUP_FILE " > "${ BACKUP_FILE }.sha256"
Step 3: Upload Backup to Storage
# Upload to S3 (or other cloud storage)
aws s3 cp " $BACKUP_FILE " \
s3://so1-backups/databases/control-plane/ \
--storage-class STANDARD_IA \
--metadata "source=control-plane-db,timestamp=$( date -Iseconds ),checksum=$( cat ${ BACKUP_FILE }.sha256)"
# Upload checksum
aws s3 cp "${ BACKUP_FILE }.sha256" \
s3://so1-backups/databases/control-plane/
# Verify upload
aws s3 ls s3://so1-backups/databases/control-plane/ | grep " $BACKUP_FILE "
# Clean up local backup (optional, after verification)
# rm "$BACKUP_FILE" "${BACKUP_FILE}.sha256"
Step 4: Automate Backup with n8n
# Create automated backup workflow
curl -X POST https://n8n.so1.io/api/v1/workflows \
-H "X-N8N-API-KEY: ${ N8N_API_KEY }" \
-H "Content-Type: application/json" \
-d '{
"name": "Database Backup - Daily",
"nodes": [
{
"name": "Schedule",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": {
"cronExpression": "0 2 * * *"
}
},
{
"name": "Trigger Backup",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://control-plane.so1.io/api/v1/admin/backup/create",
"method": "POST",
"authentication": "genericCredentialType",
"headers": {
"Authorization": "Bearer {{$env.CONTROL_PLANE_API_KEY}}"
},
"jsonParameters": true,
"bodyParameters": {
"databases": ["control-plane-db", "n8n-db", "veritas-db"],
"storage": "s3://so1-backups/databases/",
"retention_days": 30
}
}
},
{
"name": "Verify Backup",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://control-plane.so1.io/api/v1/admin/backup/verify",
"method": "POST",
"bodyParameters": {
"backup_id": "={{$json.backup_id}}"
}
}
},
{
"name": "Notify Success",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#ops-notifications",
"text": "✅ Daily database backup completed successfully\nBackup ID: {{$json.backup_id}}\nSize: {{$json.size_mb}}MB"
}
}
],
"active": true
}'
Procedure 2: Restore Database from Backup
Step 1: Identify Backup to Restore
# List available backups
aws s3 ls s3://so1-backups/databases/control-plane/ --recursive | sort -r
# Get specific backup
BACKUP_FILE = "backup_control_plane_20260310_020000.sql"
# Download backup
aws s3 cp "s3://so1-backups/databases/control-plane/${ BACKUP_FILE }" .
# Download and verify checksum
aws s3 cp "s3://so1-backups/databases/control-plane/${ BACKUP_FILE }.sha256" .
sha256sum -c "${ BACKUP_FILE }.sha256"
Step 2: Prepare for Restoration
Database restoration is a destructive operation. Always test in a staging environment first and notify the team before restoring production databases.
# Create snapshot of current database (safety measure)
SNAPSHOT_FILE = "snapshot_before_restore_$( date +%Y%m%d_%H%M%S).sql"
pg_dump " $DATABASE_URL " --format=custom --file= " $SNAPSHOT_FILE "
# Terminate active connections to database
psql " $DATABASE_URL " -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();"
# Optionally: Create new database for restoration
# psql "$DATABASE_URL" -c "CREATE DATABASE control_plane_restore;"
Step 3: Execute Restoration
# Restore database from backup
pg_restore \
--dbname= " $DATABASE_URL " \
--clean \
--if-exists \
--verbose \
" $BACKUP_FILE "
# Check restoration status
echo $? # Should be 0 for success
# Verify record counts
psql " $DATABASE_URL " -c "SELECT schemaname, tablename, n_live_tup as rows FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"
Step 4: Verify Restoration
# Run health checks
curl -s https://control-plane.so1.io/health | jq '.'
# Verify critical data
psql " $DATABASE_URL " << EOF
-- Check workflows exist
SELECT COUNT(*) as workflow_count FROM workflows;
-- Check agents exist
SELECT COUNT(*) as agent_count FROM agents;
-- Check recent activity
SELECT COUNT(*) as recent_executions FROM agent_executions WHERE created_at > NOW() - INTERVAL '1 day';
EOF
# Test API functionality
curl -s https://control-plane.so1.io/api/v1/workflows \
-H "Authorization: Bearer ${ CONTROL_PLANE_API_KEY }" \
| jq 'length'
Step 5: Resume Normal Operations
# Restart services if needed
railway service restart control-plane-api
# Notify team
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H "Content-Type: application/json" \
-d '{
"text": "✅ Database restoration completed successfully",
"attachments": [{
"color": "good",
"fields": [
{"title": "Database", "value": "control-plane-db", "short": true},
{"title": "Backup", "value": "'" $BACKUP_FILE "'", "short": true},
{"title": "Timestamp", "value": "'"$( date -Iseconds )"'", "short": false}
]
}]
}'
Procedure 3: Backup Configuration and Code
Step 1: Backup Railway Configuration
# Export Railway service configurations
railway service list --json > railway_services_backup_ $( date +%Y%m%d ) .json
# Backup environment variables (encrypted)
for service in control-plane-api console n8n ; do
railway variables --service $service --json > "railway_vars_${ service }_$( date +%Y%m%d).json"
done
# Store in secure location (encrypted)
tar -czf railway_config_backup_ $( date +%Y%m%d ) .tar.gz railway_ * .json
gpg --encrypt --recipient ops@so1.io railway_config_backup_ $( date +%Y%m%d ) .tar.gz
# Upload to secure storage
aws s3 cp railway_config_backup_ $( date +%Y%m%d ) .tar.gz.gpg \
s3://so1-backups/configurations/ \
--sse aws:kms
Step 2: Backup Veritas Prompt Library
# Clone Veritas repository (if not already local)
git clone https://github.com/so1-io/veritas.git /tmp/veritas-backup
# Create archive
cd /tmp/veritas-backup
git archive --format=tar.gz --prefix=veritas/ HEAD > ../veritas_backup_ $( date +%Y%m%d ) .tar.gz
# Upload to storage
aws s3 cp ../veritas_backup_ $( date +%Y%m%d ) .tar.gz \
s3://so1-backups/veritas/
# Verify backup
aws s3 ls s3://so1-backups/veritas/ | tail -1
Step 3: Backup n8n Workflows
# Export all n8n workflows
curl -s https://n8n.so1.io/api/v1/workflows \
-H "X-N8N-API-KEY: ${ N8N_API_KEY }" \
| jq '.' > n8n_workflows_backup_ $( date +%Y%m%d ) .json
# Upload to storage
aws s3 cp n8n_workflows_backup_ $( date +%Y%m%d ) .json \
s3://so1-backups/n8n-workflows/
# Alternative: Backup n8n database (includes credentials, executions)
# See Procedure 1 for database backup
Procedure 4: Test Disaster Recovery
Disaster Recovery (DR) testing should be performed quarterly to ensure procedures are current and effective.
Step 1: Define DR Test Scope
interface DRTest {
name : string ;
scenario : string ;
objectives : string [];
success_criteria : string [];
estimated_duration : string ;
team : string [];
}
const drTest : DRTest = {
name: "Q1 2026 DR Test" ,
scenario: "Complete data center failure - restore all services from backups" ,
objectives: [
"Restore Control Plane database from latest backup" ,
"Restore n8n workflows and configurations" ,
"Restore Veritas prompt library" ,
"Verify all services operational" ,
],
success_criteria: [
"RTO < 4 hours (time to restore services)" ,
"RPO < 24 hours (maximum data loss)" ,
"All critical services passing health checks" ,
"Sample workflows execute successfully" ,
],
estimated_duration: "4 hours" ,
team: [ "sre-lead" , "devops-engineer" , "platform-architect" ],
};
Step 2: Create Test Environment
# Create isolated Railway environment for testing
railway environment create dr-test-q1-2026
# Deploy services to test environment
railway service deploy control-plane-api --environment dr-test-q1-2026
# DO NOT use production environment for DR testing
Step 3: Execute DR Test
# Start DR test timer
DR_START = $( date +%s )
# 1. Restore databases
echo "Step 1: Restoring databases..."
# Use Procedure 2 to restore from latest backup
# 2. Restore configurations
echo "Step 2: Restoring configurations..."
# Download and decrypt Railway config backup
aws s3 cp s3://so1-backups/configurations/railway_config_backup_latest.tar.gz.gpg .
gpg --decrypt railway_config_backup_latest.tar.gz.gpg | tar -xzf -
# Apply configurations
for config in railway_vars_*.json ; do
service = $( echo $config | cut -d '_' -f3 )
jq -r 'to_entries[] | "\(.key)=\(.value)"' $config | while read var ; do
railway variables set $var --service $service --environment dr-test-q1-2026
done
done
# 3. Restore Veritas
echo "Step 3: Restoring Veritas..."
aws s3 cp s3://so1-backups/veritas/veritas_backup_latest.tar.gz .
tar -xzf veritas_backup_latest.tar.gz
cd veritas && git push origin --all --force # Push to test repo, not production
# 4. Verify services
echo "Step 4: Verifying services..."
for service in control-plane console n8n ; do
health_url = "https://${ service }.dr-test.so1.io/health"
status = $( curl -s -o /dev/null -w "%{http_code}" $health_url )
echo " $service : $status "
done
# Calculate RTO
DR_END = $( date +%s )
RTO = $(( DR_END - DR_START ))
echo "RTO: $(( RTO / 60 )) minutes"
Step 4: Document Test Results
# Generate DR test report
cat > dr_test_report_ $( date +%Y%m%d ) .md << EOF
# Disaster Recovery Test Report
**Date**: $( date -Iseconds )
**Test Name**: Q1 2026 DR Test
**Scenario**: Complete data center failure
## Results
- **RTO Achieved**: $(( RTO / 60 )) minutes (Target: <240 minutes)
- **RPO**: 12 hours (last backup: $( aws s3 ls s3://so1-backups/databases/control-plane/ | tail -1 | awk '{print $1, $2}'))
- **Services Restored**: 3/3
- **Data Integrity**: ✅ Verified
- **Functional Tests**: ✅ Passed
## Issues Encountered
1. Database restoration took longer than expected (90 minutes)
- Resolution: Need to optimize backup compression
2. Railway environment variables required manual re-entry
- Resolution: Automate variable restoration
## Recommendations
1. Increase backup frequency to every 6 hours
2. Automate Railway configuration restoration
3. Document dependencies between services
4. Schedule next DR test for Q2 2026
## Sign-off
- SRE Lead: _______________
- DevOps Engineer: _______________
- Platform Architect: _______________
EOF
# Upload report
aws s3 cp dr_test_report_ $( date +%Y%m%d ) .md s3://so1-backups/dr-reports/
Procedure 5: Emergency Recovery
Step 1: Assess Incident Severity
When disaster is detected:
# Quickly assess what's down
SERVICES = ( "control-plane" "console" "n8n" )
for service in "${ SERVICES [ @ ]}" ; do
status = $( curl -s -o /dev/null -w "%{http_code}" "https://${ service }.so1.io/health" )
if [ " $status " != "200" ]; then
echo "🔴 $service : DOWN ( $status )"
else
echo "✅ $service : UP"
fi
done
# Check database connectivity
psql " $DATABASE_URL " -c "SELECT 1" 2>&1 | grep -q "ERROR" && echo "🔴 Database: DOWN" || echo "✅ Database: UP"
Step 2: Declare DR Incident
Only declare DR incident for catastrophic failures (multiple services down, data corruption, region outage). For single service failures, use standard incident response.
# Notify team via Slack
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H "Content-Type: application/json" \
-d '{
"text": "🚨 DISASTER RECOVERY INCIDENT DECLARED",
"attachments": [{
"color": "danger",
"title": "DR Incident: Complete Platform Outage",
"fields": [
{"title": "Severity", "value": "SEV0", "short": true},
{"title": "Incident Commander", "value": "@oncall-sre", "short": true},
{"title": "Status", "value": "Recovery in progress", "short": false}
]
}]
}'
# Create incident channel
# Manual step: Create #incident-dr-YYYYMMDD channel
Step 3: Execute Emergency Restoration
Follow Procedure 2 (Database Restoration) with these modifications:
# Use most recent verified backup
LATEST_BACKUP = $( aws s3 ls s3://so1-backups/databases/control-plane/ | grep ".sql" | sort -r | head -1 | awk '{print $4}' )
# Parallel restoration (if multiple DBs affected)
( pg_restore --dbname= " $CONTROL_PLANE_DB_URL " " $LATEST_BACKUP " ) &
( pg_restore --dbname= " $N8N_DB_URL " "n8n_backup.sql" ) &
( pg_restore --dbname= " $VERITAS_DB_URL " "veritas_backup.sql" ) &
# Wait for all restorations
wait
# Verify critical data
psql " $CONTROL_PLANE_DB_URL " -c "SELECT COUNT(*) FROM workflows;"
psql " $N8N_DB_URL " -c "SELECT COUNT(*) FROM workflow_entity;"
psql " $VERITAS_DB_URL " -c "SELECT COUNT(*) FROM prompts;"
Step 4: Validate Recovery
# Run smoke tests
curl -s https://control-plane.so1.io/api/v1/workflows | jq 'length'
curl -s https://console.so1.io | grep -q "SO1 Platform"
curl -s https://n8n.so1.io/healthz | jq '.status'
# Test critical workflow
curl -X POST https://control-plane.so1.io/api/v1/agents/workflow-architect/execute \
-H "Authorization: Bearer ${ CONTROL_PLANE_API_KEY }" \
-d '{"input": "test recovery"}' | jq '.status'
Step 5: Communicate Status
# Update incident channel
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H "Content-Type: application/json" \
-d '{
"text": "✅ DISASTER RECOVERY COMPLETED",
"attachments": [{
"color": "good",
"title": "Services Restored",
"fields": [
{"title": "RTO", "value": "3.5 hours", "short": true},
{"title": "Data Loss", "value": "6 hours (RPO)", "short": true},
{"title": "Services", "value": "All services operational", "short": false}
]
}]
}'
# Create postmortem (see Incident Response runbook)
Verification Checklist
After completing backup/recovery operations, verify:
Troubleshooting
Issue Symptoms Root Cause Resolution Backup Fails pg_dump exits with error Insufficient disk space, connection timeout Check disk space, increase timeout, verify DB connection S3 Upload Fails AWS CLI returns 403/500 Invalid credentials, bucket policy Verify AWS credentials, check bucket permissions Restoration Slow pg_restore takes >2 hours Large database, network latency Use --jobs flag for parallel restore, restore from same region Data Integrity Issues Corrupted data after restore Bad backup, incomplete restore Verify backup checksum before restore, check pg_restore logs Missing Recent Data Latest transactions not in backup Backup timing, RPO exceeded Restore from more recent backup, review backup frequency Service Won’t Start Health checks fail after restore Schema mismatch, missing migrations Check migration status, run pending migrations Checksum Mismatch Backup file checksum doesn’t match File corruption during transfer Re-download backup, verify S3 object integrity
Detailed Troubleshooting: Restoration Failed
# Check pg_restore logs
pg_restore --dbname= " $DATABASE_URL " " $BACKUP_FILE " 2>&1 | tee restore.log
grep ERROR restore.log
# Common errors:
# 1. "ERROR: relation already exists"
# Solution: Add --clean flag to drop existing objects
pg_restore --dbname= " $DATABASE_URL " --clean --if-exists " $BACKUP_FILE "
# 2. "ERROR: permission denied"
# Solution: Ensure database user has sufficient privileges
psql " $DATABASE_URL " -c "ALTER USER dbuser WITH SUPERUSER;"
# 3. "ERROR: could not open file"
# Solution: Verify backup file integrity
sha256sum -c "${ BACKUP_FILE }.sha256"
# 4. Restoration hangs
# Solution: Use verbose mode and check for blocking queries
pg_restore --dbname= " $DATABASE_URL " --verbose " $BACKUP_FILE "
# Check for locks
psql " $DATABASE_URL " -c "SELECT * FROM pg_locks WHERE NOT granted;"
Best Practices
Backup Strategy
Follow 3-2-1 rule : 3 copies, 2 different media, 1 offsite
Automate backups : Never rely on manual backups alone
Verify backups : Test restoration quarterly
Encrypt sensitive data : Use encryption at rest and in transit
Document procedures : Keep runbooks updated
Retention Policy
Daily backups : Keep for 30 days
Weekly backups : Keep for 12 weeks
Monthly backups : Keep for 12 months
Yearly backups : Keep for 7 years (compliance)
Auto-delete old backups : Enforce retention with lifecycle policies
Disaster Recovery
Define RTO/RPO : Recovery Time Objective <4h, Recovery Point Objective <24h
Test regularly : Quarterly DR drills
Document everything : Clear procedures, contact lists
Automate where possible : Reduce human error
Plan for worst case : Assume total failure, no access to primary systems
Security
Encrypt backups : Use GPG or cloud-native encryption
Restrict access : Limit who can restore production data
Audit backup access : Log all backup downloads/restorations
Rotate credentials : Change backup storage credentials quarterly
Separate accounts : Use different AWS accounts for production and backups