Skip to main content

Overview

This runbook provides comprehensive deployment procedures for all SO1 platform services, including pre-deployment validation, deployment execution, post-deployment verification, and rollback procedures.
Target Audience: Platform engineers, DevOps team, release managers

Pre-Deployment Checklist

Before deploying any service, verify:
1

Code Review Complete

  • All PRs approved by at least 2 reviewers
  • CI/CD pipeline passing (tests, lints, builds)
  • No unresolved comments or blockers
2

Testing Verified

  • Unit tests passing (>90% coverage)
  • Integration tests passing
  • E2E tests passing (if applicable)
  • Manual testing completed for major features
3

Dependencies Ready

  • Database migrations tested in staging
  • Feature flags configured
  • Environment variables updated
  • External service dependencies verified
4

Monitoring Ready

  • Datadog monitors active for new features
  • PagerDuty escalation policies updated
  • Rollback plan documented
5

Communication

  • Deployment announced in #deployments Slack channel
  • Maintenance window scheduled (if needed)
  • Customer notification sent (for breaking changes)

Deployment: Control Plane API

Standard Deployment

Service: so1-control-plane-api
Platform: Railway
Deployment Strategy: Rolling update with health checks
Procedure:
  1. Announce deployment
    # Post in #deployments
    🚀 Deploying control-plane-api
    Branch: main
    Commit: abc123
    Changes: [Brief description]
    Deployer: @yourname
    ETA: 5 minutes
    
  2. Trigger deployment via Railway Deployer agent
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "railway-deployer",
        "inputs": {
          "action": "deploy",
          "service": "control-plane-api",
          "environment": "production",
          "source": {
            "type": "github",
            "branch": "main",
            "commit": "abc123"
          },
          "health_check": {
            "enabled": true,
            "endpoint": "/health",
            "timeout": 60
          }
        }
      }'
    
  3. Monitor deployment progress
    # Watch Railway logs
    railway logs --service so1-control-plane-api --follow
    
    # Expected output:
    # [deploy] Starting deployment...
    # [build] Building Docker image...
    # [build] ✓ Build complete
    # [deploy] Rolling update: 0/3 → 1/3 → 2/3 → 3/3
    # [health] Health check passed: /health returned 200
    # [deploy] ✓ Deployment successful
    
  4. Verify deployment (5 minutes) Health check:
    curl https://api.so1.io/v1/health
    # Expected: {"status": "healthy", "version": "1.2.3", "checks": {...}}
    
    Smoke tests:
    # Test authentication
    curl https://api.so1.io/v1/auth/me \
      -H "Authorization: Bearer $TEST_TOKEN"
    
    # Test workflow listing
    curl https://api.so1.io/v1/workflows
    
    # Test workflow execution
    curl -X POST https://api.so1.io/v1/workflows/test-workflow/execute
    
    Monitor metrics (Datadog):
    • Error rate < 1%
    • P99 latency < 500ms
    • Success rate > 99%
  5. Confirm deployment
    # Post in #deployments
     control-plane-api deployed successfully
    Version: 1.2.3
    Commit: abc123
    Metrics: Error rate 0.2%, p99 latency 320ms
    Rollback plan: Available if needed
    
Verification Checklist:
  • ✅ Railway deployment shows “Active” status
  • ✅ Health endpoint returning 200
  • ✅ Smoke tests passing
  • ✅ Error rate < 1% for 5 minutes
  • ✅ No alerts firing
Troubleshooting:
IssueCauseResolution
Build failsDependency issueCheck package.json, rebuild locally
Health check failsService not readyIncrease timeout, check logs
Error rate spikesCode regressionRollback immediately (see below)
Database migration failsSchema conflictRollback migration, fix locally

Database Migration Deployment

When: Deploying code that requires schema changes Procedure:
  1. Run migration in staging first
    # Connect to staging database
    railway connect --service database --environment staging
    
    # Run migration
    npm run migrate:up
    
    # Verify schema
    psql -c "\d workflows"
    
  2. Create migration rollback script
    -- migrations/rollback/20240115_add_workflow_tags.sql
    ALTER TABLE workflows DROP COLUMN IF EXISTS tags;
    DROP INDEX IF EXISTS idx_workflows_tags;
    
  3. Schedule maintenance window (if breaking change)
    # Update status page
    curl -X POST https://api.statuspage.io/v1/pages/$PAGE_ID/incidents \
      -H "Authorization: OAuth $STATUSPAGE_TOKEN" \
      -d '{
        "incident": {
          "name": "Scheduled Maintenance",
          "status": "scheduled",
          "scheduled_for": "2024-01-15T10:00:00Z",
          "scheduled_until": "2024-01-15T10:30:00Z",
          "body": "Brief service interruption for database updates"
        }
      }'
    
  4. Deploy with migration
    # Step 1: Run migration
    railway run --service database npm run migrate:up
    
    # Step 2: Deploy application code
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "railway-deployer",
        "inputs": {
          "action": "deploy",
          "service": "control-plane-api",
          "pre_deploy_command": "npm run migrate:up",
          "health_check": {"enabled": true}
        }
      }'
    
  5. Verify migration success
    # Check migration status
    railway run --service database npm run migrate:status
    
    # Expected: All migrations applied, no pending
    
    # Verify data integrity
    psql -c "SELECT COUNT(*) FROM workflows WHERE tags IS NOT NULL;"
    
Rollback Plan:
# If deployment fails, rollback migration
railway run --service database npm run migrate:down
railway run --service database psql < migrations/rollback/20240115_add_workflow_tags.sql

# Then rollback application code (see Rollback section)

Deployment: Console (Frontend)

Service: so1-console
Platform: Vercel
Deployment Strategy: Preview deployments + production promotion
Procedure:
  1. Review preview deployment
    # Vercel automatically creates preview for each PR
    # URL: https://so1-console-git-feature-branch.vercel.app
    
    # Test key flows:
    # - Login/authentication
    # - Workflow listing
    # - Workflow execution
    # - Settings pages
    
  2. Promote to production
    # Via Vercel CLI
    vercel --prod
    
    # Or via Vercel dashboard:
    # 1. Go to Deployments
    # 2. Find successful deployment
    # 3. Click "Promote to Production"
    
  3. Monitor frontend metrics
    # Vercel Analytics
    open https://vercel.com/so1/so1-console/analytics
    
    # Check:
    # - Page load times < 2s
    # - Lighthouse scores > 90
    # - Error rate < 0.1%
    
  4. Verify production
    # Manual testing in production
    open https://console.so1.io
    
    # Test critical paths:
    # ✓ Login works
    # ✓ Workflows load
    # ✓ Can trigger workflow
    # ✓ No console errors
    
Verification Checklist:
  • ✅ Vercel deployment successful
  • ✅ Production URL serving new version
  • ✅ Lighthouse scores maintained
  • ✅ No JavaScript errors in console
  • ✅ API calls working correctly

Deployment: n8n Workflows

Service: n8n workflow definitions
Platform: n8n Cloud / Self-hosted
Deployment Strategy: Version-controlled workflow JSON
Procedure:
  1. Export workflow from staging
    # Via n8n API
    curl https://n8n.so1.io/api/v1/workflows/$WORKFLOW_ID \
      -H "X-N8N-API-KEY: $N8N_API_KEY" \
      > workflows/production/webhook-processor-v2.json
    
  2. Validate workflow JSON
    # Use Workflow Architect agent
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "workflow-architect",
        "inputs": {
          "action": "validate",
          "workflow_json": "<workflow JSON>",
          "check_credentials": true,
          "check_integrations": true
        }
      }'
    
  3. Deploy to production n8n
    # Import workflow
    curl -X POST https://n8n.so1.io/api/v1/workflows \
      -H "X-N8N-API-KEY: $N8N_PROD_API_KEY" \
      -H "Content-Type: application/json" \
      -d @workflows/production/webhook-processor-v2.json
    
    # Activate workflow
    curl -X PATCH https://n8n.so1.io/api/v1/workflows/$NEW_WORKFLOW_ID \
      -H "X-N8N-API-KEY: $N8N_PROD_API_KEY" \
      -d '{"active": true}'
    
  4. Test workflow execution
    # Trigger test execution
    curl -X POST https://n8n.so1.io/webhook/test-webhook-processor \
      -H "Content-Type: application/json" \
      -d '{"test": true, "data": "sample"}'
    
    # Verify execution in n8n UI
    open https://n8n.so1.io/workflows/$NEW_WORKFLOW_ID/executions
    
  5. Update webhook URLs (if changed)
    # Update consumers to use new webhook URL
    # Document in #deployments channel
    
Verification Checklist:
  • ✅ Workflow imported successfully
  • ✅ All credentials connected
  • ✅ Test execution successful
  • ✅ Webhook URL documented
  • ✅ Old workflow deactivated (after monitoring period)

Rollback Procedures

When to Rollback

Rollback immediately if:
  • Error rate > 5% for 5+ minutes
  • Critical functionality broken
  • Database corruption detected
  • Security vulnerability introduced
  • Performance degradation >50%

Rollback: Control Plane API

Procedure (3-5 minutes):
  1. Announce rollback
    # Post in #deployments and #incidents
    ⚠️ Rolling back control-plane-api
    Reason: [Error rate spike / broken feature / etc]
    Previous version: 1.2.2
    ETA: 3 minutes
    
  2. Execute rollback via Railway
    # Option 1: Via Railway Deployer agent
    curl -X POST https://api.so1.io/v1/orchestrate \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "agent": "railway-deployer",
        "inputs": {
          "action": "rollback",
          "service": "control-plane-api",
          "target_version": "previous"
        }
      }'
    
    # Option 2: Via Railway CLI
    railway rollback --service so1-control-plane-api
    
    # Option 3: Via Railway dashboard
    # 1. Go to Deployments tab
    # 2. Find previous successful deployment
    # 3. Click "Redeploy"
    
  3. Monitor rollback
    railway logs --service so1-control-plane-api --follow
    
    # Expected:
    # [deploy] Rolling back to deployment xyz789...
    # [deploy] 3/3 → 2/3 → 1/3 → 0/3 (new), 0/3 → 1/3 → 2/3 → 3/3 (old)
    # [health] Health check passed
    # [deploy] ✓ Rollback complete
    
  4. Verify rollback success
    # Check version
    curl https://api.so1.io/v1/health | jq '.version'
    # Expected: "1.2.2" (previous version)
    
    # Check error rate
    # Datadog: Error rate should return to baseline
    
    # Smoke tests
    curl https://api.so1.io/v1/workflows
    
  5. Confirm rollback
    # Post in #deployments
     Rollback complete
    Version: 1.2.2
    Error rate: 0.3% (baseline restored)
    Investigation: [Link to incident or issue]
    
Rollback Verification:
  • ✅ Previous version deployed
  • ✅ Error rate < 1%
  • ✅ Critical functionality restored
  • ✅ No new alerts firing

Rollback: Database Migration

Procedure (5-10 minutes):
  1. Assess migration state
    railway run --service database npm run migrate:status
    
    # If migration partially applied:
    # - Some data may be in new schema
    # - Application may be incompatible with old schema
    
  2. Rollback application first
    # Step 1: Rollback application code (see above)
    railway rollback --service so1-control-plane-api
    
    # Wait for app rollback to complete
    # This ensures no new code writes to new schema
    
  3. Rollback migration
    # Run down migration
    railway run --service database npm run migrate:down
    
    # Or run manual rollback script
    railway run --service database psql < migrations/rollback/20240115_add_workflow_tags.sql
    
  4. Verify schema state
    # Check table structure
    railway run --service database psql -c "\d workflows"
    
    # Verify data integrity
    railway run --service database psql -c "SELECT COUNT(*) FROM workflows;"
    
  5. Test application with rolled-back schema
    # Run smoke tests
    curl https://api.so1.io/v1/workflows
    curl -X POST https://api.so1.io/v1/workflows/test/execute
    
Critical: If data was written to new schema, rollback may cause data loss. In this case:
  • Preserve data in temporary table before rollback
  • Migrate data back to old schema format
  • Consider forward fix instead of rollback

Gradual Rollout (Feature Flags)

When to Use

Use gradual rollout for:
  • Major new features
  • High-risk changes
  • Performance-sensitive code
  • Changes affecting many users
Procedure:
  1. Deploy code with feature flag disabled
    // In code
    if (featureFlags.isEnabled('bulk-execution')) {
      // New bulk execution logic
    } else {
      // Old single execution logic
    }
    
  2. Enable for internal testing (1% traffic)
    curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
      -H "Authorization: Bearer $SO1_API_KEY" \
      -d '{
        "enabled": true,
        "rollout_percentage": 1,
        "user_targeting": {"internal_users": true}
      }'
    
  3. Monitor metrics (1 hour)
    • Error rate
    • Performance metrics
    • User feedback
  4. Gradually increase rollout
    # 5% → 25% → 50% → 100%
    # Each stage: Monitor for 1 hour
    
    curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
      -d '{"rollout_percentage": 25}'
    
  5. Full rollout
    curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
      -d '{"enabled": true, "rollout_percentage": 100}'
    
  6. Remove feature flag (after 1 week)
    // Remove conditional logic, keep only new code
    // Bulk execution logic (now default)
    
Rollback: Set rollout_percentage: 0 to instantly disable feature

Deployment Schedule

Production Deployment Windows

DayTime (UTC)WindowNotes
Monday14:00-17:00StandardNormal deployments
Tuesday14:00-17:00StandardNormal deployments
Wednesday14:00-17:00StandardNormal deployments
Thursday14:00-17:00StandardLast normal window
Friday10:00-12:00Emergency onlyAvoid unless critical
SaturdayN/AEmergency onlyNo planned deployments
SundayN/AEmergency onlyNo planned deployments

Blackout Periods

No deployments during:
  • Customer onboarding events
  • Major marketing launches
  • Week of December 15 - January 5 (holiday freeze)
  • Identified high-traffic periods