Overview
This runbook provides comprehensive deployment procedures for all SO1 platform services, including pre-deployment validation, deployment execution, post-deployment verification, and rollback procedures.
Target Audience: Platform engineers, DevOps team, release managers
Pre-Deployment Checklist
Before deploying any service, verify:
Deployment: Control Plane API
Standard Deployment
Service: so1-control-plane-api
Platform: Railway
Deployment Strategy: Rolling update with health checks
Procedure:
-
Announce deployment
# Post in #deployments
🚀 Deploying control-plane-api
Branch: main
Commit: abc123
Changes: [Brief description]
Deployer: @yourname
ETA: 5 minutes
-
Trigger deployment via Railway Deployer agent
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "railway-deployer",
"inputs": {
"action": "deploy",
"service": "control-plane-api",
"environment": "production",
"source": {
"type": "github",
"branch": "main",
"commit": "abc123"
},
"health_check": {
"enabled": true,
"endpoint": "/health",
"timeout": 60
}
}
}'
-
Monitor deployment progress
# Watch Railway logs
railway logs --service so1-control-plane-api --follow
# Expected output:
# [deploy] Starting deployment...
# [build] Building Docker image...
# [build] ✓ Build complete
# [deploy] Rolling update: 0/3 → 1/3 → 2/3 → 3/3
# [health] Health check passed: /health returned 200
# [deploy] ✓ Deployment successful
-
Verify deployment (5 minutes)
Health check:
curl https://api.so1.io/v1/health
# Expected: {"status": "healthy", "version": "1.2.3", "checks": {...}}
Smoke tests:
# Test authentication
curl https://api.so1.io/v1/auth/me \
-H "Authorization: Bearer $TEST_TOKEN"
# Test workflow listing
curl https://api.so1.io/v1/workflows
# Test workflow execution
curl -X POST https://api.so1.io/v1/workflows/test-workflow/execute
Monitor metrics (Datadog):
- Error rate < 1%
- P99 latency < 500ms
- Success rate > 99%
-
Confirm deployment
# Post in #deployments
✅ control-plane-api deployed successfully
Version: 1.2.3
Commit: abc123
Metrics: Error rate 0.2%, p99 latency 320ms
Rollback plan: Available if needed
Verification Checklist:
- ✅ Railway deployment shows “Active” status
- ✅ Health endpoint returning 200
- ✅ Smoke tests passing
- ✅ Error rate < 1% for 5 minutes
- ✅ No alerts firing
Troubleshooting:
| Issue | Cause | Resolution |
|---|
| Build fails | Dependency issue | Check package.json, rebuild locally |
| Health check fails | Service not ready | Increase timeout, check logs |
| Error rate spikes | Code regression | Rollback immediately (see below) |
| Database migration fails | Schema conflict | Rollback migration, fix locally |
Database Migration Deployment
When: Deploying code that requires schema changes
Procedure:
-
Run migration in staging first
# Connect to staging database
railway connect --service database --environment staging
# Run migration
npm run migrate:up
# Verify schema
psql -c "\d workflows"
-
Create migration rollback script
-- migrations/rollback/20240115_add_workflow_tags.sql
ALTER TABLE workflows DROP COLUMN IF EXISTS tags;
DROP INDEX IF EXISTS idx_workflows_tags;
-
Schedule maintenance window (if breaking change)
# Update status page
curl -X POST https://api.statuspage.io/v1/pages/$PAGE_ID/incidents \
-H "Authorization: OAuth $STATUSPAGE_TOKEN" \
-d '{
"incident": {
"name": "Scheduled Maintenance",
"status": "scheduled",
"scheduled_for": "2024-01-15T10:00:00Z",
"scheduled_until": "2024-01-15T10:30:00Z",
"body": "Brief service interruption for database updates"
}
}'
-
Deploy with migration
# Step 1: Run migration
railway run --service database npm run migrate:up
# Step 2: Deploy application code
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "railway-deployer",
"inputs": {
"action": "deploy",
"service": "control-plane-api",
"pre_deploy_command": "npm run migrate:up",
"health_check": {"enabled": true}
}
}'
-
Verify migration success
# Check migration status
railway run --service database npm run migrate:status
# Expected: All migrations applied, no pending
# Verify data integrity
psql -c "SELECT COUNT(*) FROM workflows WHERE tags IS NOT NULL;"
Rollback Plan:
# If deployment fails, rollback migration
railway run --service database npm run migrate:down
railway run --service database psql < migrations/rollback/20240115_add_workflow_tags.sql
# Then rollback application code (see Rollback section)
Deployment: Console (Frontend)
Service: so1-console
Platform: Vercel
Deployment Strategy: Preview deployments + production promotion
Procedure:
-
Review preview deployment
# Vercel automatically creates preview for each PR
# URL: https://so1-console-git-feature-branch.vercel.app
# Test key flows:
# - Login/authentication
# - Workflow listing
# - Workflow execution
# - Settings pages
-
Promote to production
# Via Vercel CLI
vercel --prod
# Or via Vercel dashboard:
# 1. Go to Deployments
# 2. Find successful deployment
# 3. Click "Promote to Production"
-
Monitor frontend metrics
# Vercel Analytics
open https://vercel.com/so1/so1-console/analytics
# Check:
# - Page load times < 2s
# - Lighthouse scores > 90
# - Error rate < 0.1%
-
Verify production
# Manual testing in production
open https://console.so1.io
# Test critical paths:
# ✓ Login works
# ✓ Workflows load
# ✓ Can trigger workflow
# ✓ No console errors
Verification Checklist:
- ✅ Vercel deployment successful
- ✅ Production URL serving new version
- ✅ Lighthouse scores maintained
- ✅ No JavaScript errors in console
- ✅ API calls working correctly
Deployment: n8n Workflows
Service: n8n workflow definitions
Platform: n8n Cloud / Self-hosted
Deployment Strategy: Version-controlled workflow JSON
Procedure:
-
Export workflow from staging
# Via n8n API
curl https://n8n.so1.io/api/v1/workflows/$WORKFLOW_ID \
-H "X-N8N-API-KEY: $N8N_API_KEY" \
> workflows/production/webhook-processor-v2.json
-
Validate workflow JSON
# Use Workflow Architect agent
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "workflow-architect",
"inputs": {
"action": "validate",
"workflow_json": "<workflow JSON>",
"check_credentials": true,
"check_integrations": true
}
}'
-
Deploy to production n8n
# Import workflow
curl -X POST https://n8n.so1.io/api/v1/workflows \
-H "X-N8N-API-KEY: $N8N_PROD_API_KEY" \
-H "Content-Type: application/json" \
-d @workflows/production/webhook-processor-v2.json
# Activate workflow
curl -X PATCH https://n8n.so1.io/api/v1/workflows/$NEW_WORKFLOW_ID \
-H "X-N8N-API-KEY: $N8N_PROD_API_KEY" \
-d '{"active": true}'
-
Test workflow execution
# Trigger test execution
curl -X POST https://n8n.so1.io/webhook/test-webhook-processor \
-H "Content-Type: application/json" \
-d '{"test": true, "data": "sample"}'
# Verify execution in n8n UI
open https://n8n.so1.io/workflows/$NEW_WORKFLOW_ID/executions
-
Update webhook URLs (if changed)
# Update consumers to use new webhook URL
# Document in #deployments channel
Verification Checklist:
- ✅ Workflow imported successfully
- ✅ All credentials connected
- ✅ Test execution successful
- ✅ Webhook URL documented
- ✅ Old workflow deactivated (after monitoring period)
Rollback Procedures
When to Rollback
Rollback immediately if:
- Error rate > 5% for 5+ minutes
- Critical functionality broken
- Database corruption detected
- Security vulnerability introduced
- Performance degradation >50%
Rollback: Control Plane API
Procedure (3-5 minutes):
-
Announce rollback
# Post in #deployments and #incidents
⚠️ Rolling back control-plane-api
Reason: [Error rate spike / broken feature / etc]
Previous version: 1.2.2
ETA: 3 minutes
-
Execute rollback via Railway
# Option 1: Via Railway Deployer agent
curl -X POST https://api.so1.io/v1/orchestrate \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"agent": "railway-deployer",
"inputs": {
"action": "rollback",
"service": "control-plane-api",
"target_version": "previous"
}
}'
# Option 2: Via Railway CLI
railway rollback --service so1-control-plane-api
# Option 3: Via Railway dashboard
# 1. Go to Deployments tab
# 2. Find previous successful deployment
# 3. Click "Redeploy"
-
Monitor rollback
railway logs --service so1-control-plane-api --follow
# Expected:
# [deploy] Rolling back to deployment xyz789...
# [deploy] 3/3 → 2/3 → 1/3 → 0/3 (new), 0/3 → 1/3 → 2/3 → 3/3 (old)
# [health] Health check passed
# [deploy] ✓ Rollback complete
-
Verify rollback success
# Check version
curl https://api.so1.io/v1/health | jq '.version'
# Expected: "1.2.2" (previous version)
# Check error rate
# Datadog: Error rate should return to baseline
# Smoke tests
curl https://api.so1.io/v1/workflows
-
Confirm rollback
# Post in #deployments
✅ Rollback complete
Version: 1.2.2
Error rate: 0.3% (baseline restored)
Investigation: [Link to incident or issue]
Rollback Verification:
- ✅ Previous version deployed
- ✅ Error rate < 1%
- ✅ Critical functionality restored
- ✅ No new alerts firing
Rollback: Database Migration
Procedure (5-10 minutes):
-
Assess migration state
railway run --service database npm run migrate:status
# If migration partially applied:
# - Some data may be in new schema
# - Application may be incompatible with old schema
-
Rollback application first
# Step 1: Rollback application code (see above)
railway rollback --service so1-control-plane-api
# Wait for app rollback to complete
# This ensures no new code writes to new schema
-
Rollback migration
# Run down migration
railway run --service database npm run migrate:down
# Or run manual rollback script
railway run --service database psql < migrations/rollback/20240115_add_workflow_tags.sql
-
Verify schema state
# Check table structure
railway run --service database psql -c "\d workflows"
# Verify data integrity
railway run --service database psql -c "SELECT COUNT(*) FROM workflows;"
-
Test application with rolled-back schema
# Run smoke tests
curl https://api.so1.io/v1/workflows
curl -X POST https://api.so1.io/v1/workflows/test/execute
Critical: If data was written to new schema, rollback may cause data loss. In this case:
- Preserve data in temporary table before rollback
- Migrate data back to old schema format
- Consider forward fix instead of rollback
Gradual Rollout (Feature Flags)
When to Use
Use gradual rollout for:
- Major new features
- High-risk changes
- Performance-sensitive code
- Changes affecting many users
Procedure:
-
Deploy code with feature flag disabled
// In code
if (featureFlags.isEnabled('bulk-execution')) {
// New bulk execution logic
} else {
// Old single execution logic
}
-
Enable for internal testing (1% traffic)
curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
-H "Authorization: Bearer $SO1_API_KEY" \
-d '{
"enabled": true,
"rollout_percentage": 1,
"user_targeting": {"internal_users": true}
}'
-
Monitor metrics (1 hour)
- Error rate
- Performance metrics
- User feedback
-
Gradually increase rollout
# 5% → 25% → 50% → 100%
# Each stage: Monitor for 1 hour
curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
-d '{"rollout_percentage": 25}'
-
Full rollout
curl -X PATCH https://api.so1.io/v1/feature-flags/bulk-execution \
-d '{"enabled": true, "rollout_percentage": 100}'
-
Remove feature flag (after 1 week)
// Remove conditional logic, keep only new code
// Bulk execution logic (now default)
Rollback: Set rollout_percentage: 0 to instantly disable feature
Deployment Schedule
Production Deployment Windows
| Day | Time (UTC) | Window | Notes |
|---|
| Monday | 14:00-17:00 | Standard | Normal deployments |
| Tuesday | 14:00-17:00 | Standard | Normal deployments |
| Wednesday | 14:00-17:00 | Standard | Normal deployments |
| Thursday | 14:00-17:00 | Standard | Last normal window |
| Friday | 10:00-12:00 | Emergency only | Avoid unless critical |
| Saturday | N/A | Emergency only | No planned deployments |
| Sunday | N/A | Emergency only | No planned deployments |
Blackout Periods
No deployments during:
- Customer onboarding events
- Major marketing launches
- Week of December 15 - January 5 (holiday freeze)
- Identified high-traffic periods