Skip to main content

Overview

This runbook covers operational procedures for managing SO1 DevOps agents, including Railway deployments, GitHub Actions workflow management, and CI/CD pipeline auditing. These procedures ensure reliable infrastructure operations and deployment automation. Purpose: Provide step-by-step instructions for deploying services, managing CI/CD pipelines, and auditing infrastructure Scope: Railway platform operations, GitHub Actions workflows, pipeline security audits, infrastructure monitoring Target Audience: DevOps engineers, platform operators, SREs

Prerequisites

  • Railway account access (team: so1-io)
  • Railway API token (RAILWAY_API_TOKEN)
  • GitHub organization access (so1-io)
  • GitHub Personal Access Token with repo, workflow scopes
  • Control Plane API access (CONTROL_PLANE_API_KEY)
  • Railway CLI (railway command)
  • GitHub CLI (gh command)
  • curl or API client
  • jq for JSON parsing
  • Docker CLI (for local testing)
  • OpenCode with DevOps agents installed
  • Understanding of Railway platform concepts (projects, services, environments)
  • Familiarity with GitHub Actions YAML syntax
  • Basic knowledge of CI/CD principles
  • Understanding of infrastructure as code

Procedure 1: Deploy Service to Railway

Step 1: Prepare Service Configuration

Define service requirements:
# railway.yaml
service:
  name: control-plane-api
  runtime: nodejs
  buildCommand: npm run build
  startCommand: npm start
  healthCheckPath: /health
  healthCheckInterval: 30
  environment:
    NODE_ENV: production
    PORT: 3000

Step 2: Invoke Railway Deployer Agent

curl -X POST https://control-plane.so1.io/api/v1/agents/railway-deployer/deploy \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "service_name": "control-plane-api",
    "project": "so1-platform",
    "environment": "production",
    "source": {
      "type": "github",
      "repo": "so1-io/control-plane",
      "branch": "main"
    },
    "config": {
      "buildCommand": "npm run build",
      "startCommand": "npm start",
      "healthCheckPath": "/health",
      "healthCheckInterval": 30,
      "region": "us-west-2"
    },
    "environment_variables": {
      "NODE_ENV": "production",
      "DATABASE_URL": "${DATABASE_URL}",
      "REDIS_URL": "${REDIS_URL}"
    },
    "resources": {
      "memory": "2GB",
      "cpu": "2"
    }
  }' | jq '.'
Expected Response:
{
  "deployment_id": "dep_5Xy9mPqRs",
  "service_id": "srv_3TnVw9Kj2m",
  "status": "deploying",
  "url": "https://control-plane-api.railway.app",
  "estimated_completion": "2026-03-10T14:35:00Z",
  "steps": [
    {"step": "build", "status": "in_progress"},
    {"step": "deploy", "status": "pending"},
    {"step": "health_check", "status": "pending"}
  ]
}

Step 3: Monitor Deployment Progress

# Watch deployment logs
railway logs --service control-plane-api --environment production --follow

# Check deployment status
curl -s https://backboard.railway.app/graphql \
  -H "Authorization: Bearer ${RAILWAY_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "query { deployment(id: \"dep_5Xy9mPqRs\") { status, logs { message, timestamp } } }"
  }' | jq '.'

Step 4: Verify Deployment Health

# Wait for deployment to complete
while [ "$(curl -s https://backboard.railway.app/graphql \
  -H "Authorization: Bearer ${RAILWAY_API_TOKEN}" \
  -d '{"query": "query { deployment(id: \"dep_5Xy9mPqRs\") { status } }"}' \
  | jq -r '.data.deployment.status')" != "SUCCESS" ]; do
  echo "Waiting for deployment..."
  sleep 5
done

# Check health endpoint
curl -s https://control-plane-api.railway.app/health | jq '.'

# Expected response:
# {
#   "status": "healthy",
#   "version": "1.2.3",
#   "timestamp": "2026-03-10T14:35:45Z",
#   "checks": {
#     "database": "up",
#     "redis": "up"
#   }
# }

Step 5: Configure Custom Domain (Optional)

# Add custom domain via Railway CLI
railway domain add control-plane.so1.io --service control-plane-api

# Verify DNS configuration
dig control-plane.so1.io CNAME

# Wait for SSL certificate provisioning
railway domain list --service control-plane-api

Procedure 2: Manage GitHub Actions CI/CD

Step 1: Review Current Workflows

# List all GitHub Actions workflows
gh workflow list --repo so1-io/control-plane

# Get workflow details
gh workflow view ci-cd.yml --repo so1-io/control-plane

Step 2: Generate Workflow with GitHub Actions Agent

curl -X POST https://control-plane.so1.io/api/v1/agents/github-actions/generate \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_name": "ci-cd",
    "repository": "so1-io/control-plane",
    "triggers": ["push", "pull_request"],
    "branches": ["main", "develop"],
    "jobs": [
      {
        "name": "test",
        "runner": "ubuntu-latest",
        "steps": [
          "checkout",
          "setup_node_20",
          "install_dependencies",
          "run_tests",
          "upload_coverage"
        ]
      },
      {
        "name": "build",
        "runner": "ubuntu-latest",
        "needs": ["test"],
        "steps": [
          "checkout",
          "setup_node_20",
          "build_application",
          "build_docker_image"
        ]
      },
      {
        "name": "deploy",
        "runner": "ubuntu-latest",
        "needs": ["build"],
        "if": "github.ref == '\''refs/heads/main'\''",
        "environment": "production",
        "steps": [
          "checkout",
          "deploy_to_railway"
        ]
      }
    ],
    "secrets": [
      "RAILWAY_API_TOKEN",
      "CONTROL_PLANE_API_KEY"
    ]
  }' | jq '.'
Expected Response:
{
  "generation_id": "gen_7TnVw9Kj2m",
  "workflow_file": ".github/workflows/ci-cd.yml",
  "content": "name: CI/CD\n\non:\n  push:\n    branches: [main, develop]\n  pull_request:\n    branches: [main, develop]\n\njobs:\n  test:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: actions/setup-node@v4\n        with:\n          node-version: '20'\n          cache: 'npm'\n      - run: npm ci\n      - run: npm test\n      - uses: codecov/codecov-action@v3\n\n  build:\n    runs-on: ubuntu-latest\n    needs: test\n    steps:\n      - uses: actions/checkout@v4\n      - uses: actions/setup-node@v4\n        with:\n          node-version: '20'\n      - run: npm ci\n      - run: npm run build\n      - uses: docker/build-push-action@v5\n        with:\n          context: .\n          push: false\n\n  deploy:\n    runs-on: ubuntu-latest\n    needs: build\n    if: github.ref == 'refs/heads/main'\n    environment: production\n    steps:\n      - uses: actions/checkout@v4\n      - name: Deploy to Railway\n        env:\n          RAILWAY_TOKEN: ${{ secrets.RAILWAY_API_TOKEN }}\n        run: |\n          npm install -g @railway/cli\n          railway up --service control-plane-api\n"
}

Step 3: Deploy Workflow to Repository

# Save generated workflow
curl -s https://control-plane.so1.io/api/v1/agents/github-actions/generations/gen_7TnVw9Kj2m \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq -r '.content' > .github/workflows/ci-cd.yml

# Commit and push
git add .github/workflows/ci-cd.yml
git commit -m "Add CI/CD workflow generated by github-actions agent"
git push origin main

# Verify workflow is active
gh workflow view ci-cd.yml --repo so1-io/control-plane

Step 4: Trigger Workflow Run

# Manually trigger workflow
gh workflow run ci-cd.yml --repo so1-io/control-plane

# Watch workflow execution
gh run watch --repo so1-io/control-plane

# Get run details
gh run list --workflow ci-cd.yml --repo so1-io/control-plane --limit 1

Step 5: Review Workflow Results

# Get run logs
RUN_ID=$(gh run list --workflow ci-cd.yml --repo so1-io/control-plane --limit 1 --json databaseId --jq '.[0].databaseId')
gh run view $RUN_ID --repo so1-io/control-plane --log

# Check for failures
gh run view $RUN_ID --repo so1-io/control-plane --json conclusion,jobs \
  | jq '.jobs[] | select(.conclusion != "success")'

Procedure 3: Audit CI/CD Pipelines

Step 1: Invoke Pipeline Auditor Agent

curl -X POST https://control-plane.so1.io/api/v1/agents/pipeline-auditor/audit \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "scope": "organization",
    "organization": "so1-io",
    "repositories": ["control-plane", "console", "so1-agents"],
    "audit_criteria": [
      "security",
      "compliance",
      "performance",
      "best_practices"
    ],
    "include_recommendations": true
  }' | jq '.'
Expected Response:
{
  "audit_id": "audit_2mP7nQzXy",
  "timestamp": "2026-03-10T14:40:00Z",
  "summary": {
    "total_pipelines": 15,
    "critical_issues": 2,
    "warnings": 8,
    "passed": 5
  },
  "findings": [
    {
      "severity": "critical",
      "category": "security",
      "repository": "control-plane",
      "workflow": "ci-cd.yml",
      "issue": "Secrets exposed in workflow logs",
      "description": "Secret values are being printed to logs in deploy job",
      "recommendation": "Use GitHub Actions secret masking and avoid echoing secrets",
      "remediation": {
        "type": "code_change",
        "diff": "- run: echo ${{ secrets.DATABASE_URL }}\n+ run: echo 'Deploying with configured secrets'"
      }
    },
    {
      "severity": "warning",
      "category": "performance",
      "repository": "console",
      "workflow": "build-test.yml",
      "issue": "No caching configured for npm dependencies",
      "description": "npm install runs on every workflow execution without cache",
      "recommendation": "Use actions/cache or actions/setup-node cache feature",
      "impact": {
        "current_avg_time": "3m 45s",
        "estimated_improvement": "-2m 30s"
      }
    }
  ]
}

Step 2: Review Critical Issues

# Get critical findings only
curl -s https://control-plane.so1.io/api/v1/agents/pipeline-auditor/audits/audit_2mP7nQzXy \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq '.findings[] | select(.severity == "critical")'

# Generate detailed report
curl -s https://control-plane.so1.io/api/v1/agents/pipeline-auditor/audits/audit_2mP7nQzXy/report \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -H "Accept: application/pdf" \
  > pipeline-audit-report.pdf

Step 3: Apply Remediation

# For each critical/high severity issue:

# 1. Create branch for fix
git checkout -b fix/pipeline-security-issues

# 2. Apply recommended changes
# (from remediation.diff in audit results)

# 3. Commit and create PR
git add .github/workflows/ci-cd.yml
git commit -m "Fix security issues identified in pipeline audit (audit_2mP7nQzXy)"
git push origin fix/pipeline-security-issues

gh pr create \
  --title "Fix pipeline security issues" \
  --body "Addresses critical findings from pipeline audit audit_2mP7nQzXy" \
  --repo so1-io/control-plane

Step 4: Re-run Audit to Verify Fixes

# Wait for PR merge and workflow completion

# Re-run audit
curl -X POST https://control-plane.so1.io/api/v1/agents/pipeline-auditor/audit \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "scope": "repository",
    "repository": "so1-io/control-plane",
    "compare_to_audit": "audit_2mP7nQzXy"
  }' | jq '.summary'

# Expected: critical_issues reduced to 0

Procedure 4: Rollback Railway Deployment

Step 1: Identify Previous Deployment

# List recent deployments
railway deployments --service control-plane-api --limit 10

# Get deployment details
curl -s https://backboard.railway.app/graphql \
  -H "Authorization: Bearer ${RAILWAY_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "query { service(id: \"srv_3TnVw9Kj2m\") { deployments(first: 10) { edges { node { id, status, createdAt, meta { repo, branch, commitHash } } } } } }"
  }' | jq '.data.service.deployments.edges[].node | {id, status, createdAt, commit: .meta.commitHash}'

Step 2: Initiate Rollback

# Rollback to previous deployment
PREVIOUS_DEPLOYMENT_ID="dep_4WxYz8LmNo"

curl -X POST https://control-plane.so1.io/api/v1/agents/railway-deployer/rollback \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "service_id": "srv_3TnVw9Kj2m",
    "target_deployment_id": "'"${PREVIOUS_DEPLOYMENT_ID}"'",
    "reason": "Critical bug in production deployment",
    "notify_team": true
  }' | jq '.'
Expected Response:
{
  "rollback_id": "rb_9Kj2mP7nQz",
  "status": "in_progress",
  "from_deployment": "dep_5Xy9mPqRs",
  "to_deployment": "dep_4WxYz8LmNo",
  "estimated_completion": "2026-03-10T14:50:00Z"
}

Step 3: Monitor Rollback

# Watch rollback logs
railway logs --service control-plane-api --environment production --follow

# Check rollback status
curl -s https://control-plane.so1.io/api/v1/agents/railway-deployer/rollbacks/rb_9Kj2mP7nQz \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" | jq '.status'

Step 4: Verify Rollback Success

# Check service health
curl -s https://control-plane-api.railway.app/health | jq '.'

# Verify deployment version
curl -s https://control-plane-api.railway.app/version | jq '.commit'

# Compare with expected commit
echo "Expected: $(git rev-parse HEAD~1)"
echo "Actual:   $(curl -s https://control-plane-api.railway.app/version | jq -r '.commit')"

Verification Checklist

After completing DevOps operations, verify:

Troubleshooting

IssueSymptomsRoot CauseResolution
Deployment FailedBuild errors in Railway logsMissing dependencies, build script errorsCheck package.json, verify buildCommand in railway.yaml
Service Not StartingDeployment succeeds but service crashesRuntime errors, missing env varsCheck Railway logs, verify startCommand and environment variables
Health Check FailingRailway marks service as unhealthyWrong health check path, service not listening on PORTUpdate healthCheckPath, ensure app binds to process.env.PORT
GitHub Actions TimeoutWorkflow runs exceed 6 hoursLong-running tests, inefficient buildsAdd caching (actions/cache), parallelize jobs, optimize test suite
Secret Not FoundWorkflow fails with “secret not found” errorSecret not configured in repo settingsAdd secret in GitHub repo settings → Secrets and variables → Actions
Railway API 401API calls return UnauthorizedExpired or invalid API tokenGenerate new token at railway.app/account/tokens
Domain Not ResolvingCustom domain returns 404DNS not configured or propagatingUpdate DNS CNAME record, wait for propagation (up to 48h)
Rollback FailedRollback deployment crashesTarget deployment incompatible with current dataRestore database snapshot, apply backward migrations

Detailed Troubleshooting: Deployment Failed

# Get full build logs
railway logs --service control-plane-api --deployment dep_5Xy9mPqRs

# Common issues:

# 1. Missing NODE_ENV
railway variables --service control-plane-api
railway variables set NODE_ENV=production --service control-plane-api

# 2. Build script failing
# Check package.json build script
cat package.json | jq '.scripts.build'

# 3. Dependency installation errors
# Force npm ci instead of npm install
railway run --service control-plane-api 'npm ci --verbose'

# 4. TypeScript compilation errors
railway run --service control-plane-api 'npx tsc --noEmit'

Detailed Troubleshooting: GitHub Actions Failing

# Get workflow run details
RUN_ID=$(gh run list --workflow ci-cd.yml --repo so1-io/control-plane --limit 1 --json databaseId --jq '.[0].databaseId')
gh run view $RUN_ID --repo so1-io/control-plane --log --job test

# Common fixes:

# 1. Add dependency caching
# Update workflow:
- uses: actions/setup-node@v4
  with:
    node-version: '20'
    cache: 'npm'

# 2. Fix secret access
# Ensure secrets are defined in:
# GitHub Repo → Settings → Secrets and variables → Actions

# 3. Increase timeout for long jobs
jobs:
  test:
    timeout-minutes: 30

# 4. Split large jobs into parallel matrix
jobs:
  test:
    strategy:
      matrix:
        node: [18, 20]
        test-suite: [unit, integration, e2e]


Best Practices

Railway Deployments

  1. Always use health checks to ensure Railway can verify service health
  2. Set resource limits to prevent runaway costs:
    resources:
      memory: 2GB
      cpu: 2
    
  3. Use Railway environments (staging, production) for safe deployments
  4. Enable auto-scaling for services with variable load
  5. Monitor deployment metrics: CPU, memory, request rate, error rate

GitHub Actions

  1. Cache dependencies to speed up workflows (can reduce time by 50-80%)
  2. Use matrix builds for testing across multiple Node versions/OSes
  3. Implement branch protection requiring CI checks to pass before merge
  4. Store secrets securely in GitHub Secrets, never in code
  5. Use concurrency controls to cancel outdated workflow runs:
    concurrency:
      group: ${{ github.workflow }}-${{ github.ref }}
      cancel-in-progress: true
    

Pipeline Security

  1. Run pipeline audits monthly to catch new vulnerabilities
  2. Use CODEOWNERS file to require reviews on workflow changes
  3. Enable branch protection on main/production branches
  4. Rotate secrets regularly (every 90 days)
  5. Use least-privilege tokens (e.g., GITHUB_TOKEN with minimal scopes)
  6. Pin action versions to specific commits or tags:
    - uses: actions/checkout@v4.1.1
    # Better: use commit SHA
    - uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608