Overview
This runbook covers operational procedures for managing SO1 DevOps agents, including Railway deployments, GitHub Actions workflow management, and CI/CD pipeline auditing. These procedures ensure reliable infrastructure operations and deployment automation. Purpose: Provide step-by-step instructions for deploying services, managing CI/CD pipelines, and auditing infrastructure Scope: Railway platform operations, GitHub Actions workflows, pipeline security audits, infrastructure monitoring Target Audience: DevOps engineers, platform operators, SREsPrerequisites
Required Access
Required Access
- Railway account access (team:
so1-io) - Railway API token (
RAILWAY_API_TOKEN) - GitHub organization access (
so1-io) - GitHub Personal Access Token with
repo,workflowscopes - Control Plane API access (
CONTROL_PLANE_API_KEY)
Required Tools
Required Tools
- Railway CLI (
railwaycommand) - GitHub CLI (
ghcommand) curlor API clientjqfor JSON parsing- Docker CLI (for local testing)
- OpenCode with DevOps agents installed
Required Knowledge
Required Knowledge
- Understanding of Railway platform concepts (projects, services, environments)
- Familiarity with GitHub Actions YAML syntax
- Basic knowledge of CI/CD principles
- Understanding of infrastructure as code
Procedure 1: Deploy Service to Railway
Step 1: Prepare Service Configuration
Define service requirements:Step 2: Invoke Railway Deployer Agent
Step 3: Monitor Deployment Progress
Step 4: Verify Deployment Health
Step 5: Configure Custom Domain (Optional)
Procedure 2: Manage GitHub Actions CI/CD
Step 1: Review Current Workflows
Step 2: Generate Workflow with GitHub Actions Agent
Step 3: Deploy Workflow to Repository
Step 4: Trigger Workflow Run
Step 5: Review Workflow Results
Procedure 3: Audit CI/CD Pipelines
Step 1: Invoke Pipeline Auditor Agent
Step 2: Review Critical Issues
Step 3: Apply Remediation
Step 4: Re-run Audit to Verify Fixes
Procedure 4: Rollback Railway Deployment
Step 1: Identify Previous Deployment
Step 2: Initiate Rollback
Step 3: Monitor Rollback
Step 4: Verify Rollback Success
Verification Checklist
After completing DevOps operations, verify:Troubleshooting
| Issue | Symptoms | Root Cause | Resolution |
|---|---|---|---|
| Deployment Failed | Build errors in Railway logs | Missing dependencies, build script errors | Check package.json, verify buildCommand in railway.yaml |
| Service Not Starting | Deployment succeeds but service crashes | Runtime errors, missing env vars | Check Railway logs, verify startCommand and environment variables |
| Health Check Failing | Railway marks service as unhealthy | Wrong health check path, service not listening on PORT | Update healthCheckPath, ensure app binds to process.env.PORT |
| GitHub Actions Timeout | Workflow runs exceed 6 hours | Long-running tests, inefficient builds | Add caching (actions/cache), parallelize jobs, optimize test suite |
| Secret Not Found | Workflow fails with “secret not found” error | Secret not configured in repo settings | Add secret in GitHub repo settings → Secrets and variables → Actions |
| Railway API 401 | API calls return Unauthorized | Expired or invalid API token | Generate new token at railway.app/account/tokens |
| Domain Not Resolving | Custom domain returns 404 | DNS not configured or propagating | Update DNS CNAME record, wait for propagation (up to 48h) |
| Rollback Failed | Rollback deployment crashes | Target deployment incompatible with current data | Restore database snapshot, apply backward migrations |
Detailed Troubleshooting: Deployment Failed
Detailed Troubleshooting: GitHub Actions Failing
Related Resources
Railway Deployer Agent
Railway platform deployments
GitHub Actions Agent
CI/CD workflow generation
Pipeline Auditor Agent
Security and compliance auditing
Deployment Runbook
End-to-end deployment procedures
Best Practices
Railway Deployments
- Always use health checks to ensure Railway can verify service health
- Set resource limits to prevent runaway costs:
- Use Railway environments (staging, production) for safe deployments
- Enable auto-scaling for services with variable load
- Monitor deployment metrics: CPU, memory, request rate, error rate
GitHub Actions
- Cache dependencies to speed up workflows (can reduce time by 50-80%)
- Use matrix builds for testing across multiple Node versions/OSes
- Implement branch protection requiring CI checks to pass before merge
- Store secrets securely in GitHub Secrets, never in code
- Use concurrency controls to cancel outdated workflow runs:
Pipeline Security
- Run pipeline audits monthly to catch new vulnerabilities
- Use CODEOWNERS file to require reviews on workflow changes
- Enable branch protection on main/production branches
- Rotate secrets regularly (every 90 days)
- Use least-privilege tokens (e.g., GITHUB_TOKEN with minimal scopes)
- Pin action versions to specific commits or tags: