Skip to main content

Overview

This runbook covers operational procedures for monitoring SO1 platform health, collecting metrics, and managing alerts. These procedures ensure proactive detection of issues, rapid response to incidents, and continuous visibility into system performance. Purpose: Provide step-by-step instructions for setting up monitoring, analyzing metrics, and responding to alerts Scope: Health checks, metrics collection, alerting rules, dashboard configuration, log analysis Target Audience: SREs, DevOps engineers, on-call operators

Prerequisites

  • Control Plane API access (CONTROL_PLANE_API_KEY)
  • Railway project access (all services)
  • Vercel project access (Console)
  • n8n workflow access
  • Slack workspace access (alert channels)
  • Monitoring dashboard access (Grafana/DataDog)
  • curl or API client
  • Railway CLI (railway command)
  • jq for JSON parsing
  • Log analysis tools (grep, awk)
  • Monitoring agents (if applicable)
  • Understanding of SO1 architecture
  • Familiarity with HTTP status codes and API health patterns
  • Basic knowledge of metrics and observability
  • Understanding of alert severity levels

Procedure 1: Configure Health Checks

Step 1: Implement Health Endpoints

All services should expose a /health endpoint:
// Example: Hono health endpoint
import { Hono } from 'hono';

const app = new Hono();

app.get('/health', async (c) => {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkExternalAPIs(),
  ]);

  const health = {
    status: checks.every(r => r.status === 'fulfilled') ? 'healthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.APP_VERSION || 'unknown',
    checks: {
      database: checks[0].status === 'fulfilled' ? 'up' : 'down',
      redis: checks[1].status === 'fulfilled' ? 'up' : 'down',
      external_apis: checks[2].status === 'fulfilled' ? 'up' : 'down',
    },
  };

  const statusCode = health.status === 'healthy' ? 200 : 503;
  return c.json(health, statusCode);
});

async function checkDatabase() {
  // Simple query to verify DB connection
  await db.execute('SELECT 1');
}

async function checkRedis() {
  await redis.ping();
}

async function checkExternalAPIs() {
  // Check critical external dependencies
  const response = await fetch('https://api.openai.com/v1/models', {
    headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` }
  });
  if (!response.ok) throw new Error('OpenAI API unavailable');
}

Step 2: Configure Railway Health Checks

# Set health check in Railway service settings
railway service update control-plane-api \
  --healthcheck-path /health \
  --healthcheck-interval 30

# Verify health check configuration
railway service show control-plane-api | grep -i health

Step 3: Test Health Endpoints

# Test all service health endpoints
SERVICES=(
  "https://control-plane.so1.io"
  "https://console.so1.io"
  "https://n8n.so1.io"
)

for service in "${SERVICES[@]}"; do
  echo "Testing: $service/health"
  response=$(curl -s -w "\n%{http_code}" "$service/health")
  body=$(echo "$response" | head -n -1)
  status=$(echo "$response" | tail -n 1)
  
  echo "Status: $status"
  echo "$body" | jq '.'
  echo "---"
done
Expected Healthy Response:
{
  "status": "healthy",
  "timestamp": "2026-03-10T15:00:00Z",
  "version": "1.2.3",
  "checks": {
    "database": "up",
    "redis": "up",
    "external_apis": "up"
  }
}

Procedure 2: Set Up Metrics Collection

Step 1: Instrument Application Code

// Example: Add metrics to Hono app
import { Hono } from 'hono';
import { timing } from 'hono/timing';

const app = new Hono();

// Request timing middleware
app.use('*', timing());

// Custom metrics endpoint
app.get('/metrics', async (c) => {
  const metrics = {
    timestamp: Date.now(),
    requests: {
      total: await getMetric('requests_total'),
      success: await getMetric('requests_success'),
      errors: await getMetric('requests_errors'),
    },
    response_times: {
      p50: await getMetric('response_time_p50'),
      p95: await getMetric('response_time_p95'),
      p99: await getMetric('response_time_p99'),
    },
    agents: {
      executions_total: await getMetric('agent_executions_total'),
      executions_success: await getMetric('agent_executions_success'),
      executions_failed: await getMetric('agent_executions_failed'),
    },
    database: {
      connections_active: await getMetric('db_connections_active'),
      query_duration_avg: await getMetric('db_query_duration_avg'),
    },
  };

  return c.json(metrics);
});

// Increment metrics on each request
app.use('*', async (c, next) => {
  await incrementMetric('requests_total');
  const start = Date.now();
  
  try {
    await next();
    await incrementMetric('requests_success');
  } catch (error) {
    await incrementMetric('requests_errors');
    throw error;
  } finally {
    const duration = Date.now() - start;
    await recordMetric('response_time', duration);
  }
});

Step 2: Create Metrics Collection Workflow

# Create n8n workflow to collect metrics periodically
curl -X POST https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Metrics Collection",
    "nodes": [
      {
        "name": "Schedule",
        "type": "n8n-nodes-base.scheduleTrigger",
        "parameters": {
          "cronExpression": "*/1 * * * *"
        }
      },
      {
        "name": "Fetch Control Plane Metrics",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/metrics",
          "method": "GET"
        }
      },
      {
        "name": "Store Metrics",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://metrics-db.so1.io/api/v1/metrics",
          "method": "POST",
          "bodyParameters": {
            "service": "control-plane",
            "metrics": "={{$json}}"
          }
        }
      }
    ],
    "active": true
  }'

Step 3: Query Metrics

# Get recent metrics for Control Plane
curl -s https://control-plane.so1.io/metrics \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" | jq '.'

# Get aggregated metrics over time
curl -s "https://metrics-db.so1.io/api/v1/query?service=control-plane&metric=response_time_p95&timerange=1h" \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq '.datapoints'

# Get agent execution metrics
curl -s "https://control-plane.so1.io/api/v1/analytics/agent-executions?timeframe=24h" \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq '{
    total_executions: .total,
    success_rate: .success_rate,
    avg_duration_ms: .avg_duration_ms,
    by_agent: .by_agent
  }'

Procedure 3: Configure Alerting Rules

Step 1: Define Alert Thresholds

// Alert configuration
interface AlertRule {
  name: string;
  metric: string;
  condition: 'above' | 'below' | 'equals';
  threshold: number;
  duration: string; // e.g., "5m" = trigger if condition persists for 5 minutes
  severity: 'critical' | 'warning' | 'info';
  channels: string[];
}

const alertRules: AlertRule[] = [
  {
    name: 'High Error Rate',
    metric: 'error_rate',
    condition: 'above',
    threshold: 0.05, // 5%
    duration: '5m',
    severity: 'critical',
    channels: ['slack_engineering', 'pagerduty'],
  },
  {
    name: 'Slow API Response',
    metric: 'response_time_p95',
    condition: 'above',
    threshold: 2000, // 2000ms
    duration: '10m',
    severity: 'warning',
    channels: ['slack_engineering'],
  },
  {
    name: 'Database Connection Pool Exhausted',
    metric: 'db_connections_active',
    condition: 'above',
    threshold: 90, // 90% of max connections
    duration: '3m',
    severity: 'critical',
    channels: ['slack_engineering', 'pagerduty'],
  },
  {
    name: 'Agent Execution Failure Spike',
    metric: 'agent_failures_per_minute',
    condition: 'above',
    threshold: 10,
    duration: '5m',
    severity: 'warning',
    channels: ['slack_engineering'],
  },
];

Step 2: Create Alert Workflow

# Create n8n workflow for alerting
curl -X POST https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Alert Manager",
    "nodes": [
      {
        "name": "Schedule Check",
        "type": "n8n-nodes-base.scheduleTrigger",
        "parameters": {
          "cronExpression": "*/1 * * * *"
        }
      },
      {
        "name": "Fetch Metrics",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/metrics"
        }
      },
      {
        "name": "Evaluate Alert Rules",
        "type": "n8n-nodes-base.function",
        "parameters": {
          "functionCode": "// Check if metrics exceed thresholds\nconst metrics = items[0].json;\nconst alerts = [];\n\nif (metrics.requests.errors / metrics.requests.total > 0.05) {\n  alerts.push({ rule: \"High Error Rate\", severity: \"critical\" });\n}\n\nif (metrics.response_times.p95 > 2000) {\n  alerts.push({ rule: \"Slow API Response\", severity: \"warning\" });\n}\n\nreturn alerts.map(a => ({ json: a }));"
        }
      },
      {
        "name": "Send Slack Alert",
        "type": "n8n-nodes-base.slack",
        "parameters": {
          "channel": "#engineering-alerts",
          "text": "🚨 Alert: {{$json.rule}} ({{$json.severity}})"
        }
      }
    ],
    "active": true
  }'

Step 3: Test Alerting

# Trigger test alert
curl -X POST https://n8n.so1.io/webhook/alerts/test \
  -H "Content-Type: application/json" \
  -d '{
    "rule": "Test Alert",
    "severity": "info",
    "message": "Testing alert delivery system",
    "timestamp": "'"$(date -Iseconds)"'"
  }'

# Verify alert received in Slack
# Check #engineering-alerts channel

Procedure 4: Analyze Logs

Step 1: Access Service Logs

# Railway logs for Control Plane
railway logs --service control-plane-api --environment production --tail 100

# Filter for errors
railway logs --service control-plane-api --environment production \
  | grep -i "error\|exception\|fail"

# Get logs for specific time range
railway logs --service control-plane-api \
  --since "2026-03-10T14:00:00Z" \
  --until "2026-03-10T15:00:00Z"

Step 2: Common Log Patterns

Error Rate Analysis:
# Count errors per minute in last hour
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "ERROR" \
  | awk '{print $1}' \
  | cut -d: -f1,2 \
  | sort | uniq -c

# Find most common error messages
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "ERROR" \
  | awk -F'ERROR' '{print $2}' \
  | sort | uniq -c | sort -rn | head -10
Slow Request Analysis:
# Find requests taking >2s
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "duration_ms" \
  | awk '{for(i=1;i<=NF;i++){if($i~/duration_ms/){print $(i+1)}}}' \
  | awk -F'[":,]' '$2 > 2000 {print}' \
  | wc -l
Agent Execution Analysis:
# Get agent execution stats from logs
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "agent_execution" \
  | jq -r '[.agent_id, .duration_ms, .status] | @tsv' \
  | awk '{
    agent[$1]++;
    total_time[$1]+=$2;
    if($3=="success") success[$1]++;
  } END {
    for(a in agent) {
      print a, agent[a], success[a], int(total_time[a]/agent[a]);
    }
  }' | column -t

Step 3: Set Up Log Aggregation

For production systems, use centralized logging:
# Example: Configure log shipping to external service
# (DataDog, New Relic, Logtail, etc.)

# Railway automatically ships logs to stdout
# Configure log drain in Railway dashboard:
# Project Settings → Integrations → Add Log Drain

# Verify log drain working
curl -s https://logs-api.logtail.com/query \
  -H "Authorization: Bearer ${LOGTAIL_API_KEY}" \
  -d '{
    "query": "service:control-plane-api level:error",
    "timerange": "1h"
  }' | jq '.results | length'

Procedure 5: Create Monitoring Dashboard

Step 1: Define Dashboard Metrics

Key metrics to display:

Service Health

  • Uptime percentage
  • Health check status
  • Error rate
  • Response time (p50, p95, p99)

Agent Performance

  • Executions per minute
  • Success rate
  • Average duration
  • Failed agents breakdown

Infrastructure

  • CPU usage
  • Memory usage
  • Database connections
  • Network throughput

Business Metrics

  • Workflows created
  • Active users
  • API requests
  • Feature usage

Step 2: Create Dashboard Configuration

{
  "dashboard": {
    "title": "SO1 Platform Overview",
    "refresh_interval": "30s",
    "panels": [
      {
        "title": "Service Health",
        "type": "status_grid",
        "queries": [
          {
            "name": "Control Plane",
            "endpoint": "https://control-plane.so1.io/health"
          },
          {
            "name": "Console",
            "endpoint": "https://console.so1.io/health"
          },
          {
            "name": "n8n",
            "endpoint": "https://n8n.so1.io/health"
          }
        ]
      },
      {
        "title": "Request Rate",
        "type": "line_chart",
        "query": "SELECT time, count(*) FROM requests GROUP BY time(1m)",
        "timerange": "1h"
      },
      {
        "title": "Error Rate",
        "type": "line_chart",
        "query": "SELECT time, (errors / total) * 100 FROM requests GROUP BY time(1m)",
        "threshold": {
          "warning": 1,
          "critical": 5
        }
      },
      {
        "title": "Response Time (p95)",
        "type": "line_chart",
        "query": "SELECT time, percentile(duration_ms, 95) FROM requests GROUP BY time(1m)",
        "threshold": {
          "warning": 1000,
          "critical": 2000
        }
      }
    ]
  }
}

Step 3: Access Dashboard

# Dashboard URL (example with Grafana)
echo "Dashboard: https://grafana.so1.io/d/platform-overview"

# Or create simple status page
curl -s https://control-plane.so1.io/api/v1/status/dashboard \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq '.'

Verification Checklist

After setting up monitoring and alerting, verify:

Troubleshooting

IssueSymptomsRoot CauseResolution
Health Check FailingRailway marks service unhealthyService not responding, check timeoutIncrease timeout, verify /health endpoint works, check service logs
Missing MetricsDashboard shows no dataMetrics collection workflow disabledCheck n8n workflow active status, verify metrics endpoint
Alerts Not FiringNo alerts despite high error rateAlert threshold too high, rule disabledReview alert configuration, test with lower threshold
Duplicate AlertsSame alert firing repeatedlyNo alert deduplicationImplement alert grouping, add cooldown period
Logs Not Showingrailway logs returns nothingService not writing to stdoutUpdate app logging to console.log, check Railway log drain
High Memory UsageService restarting frequentlyMemory leak, insufficient resourcesAnalyze heap dumps, increase memory allocation
Slow DashboardDashboard takes >10s to loadToo many metrics queriesReduce query frequency, add caching, optimize queries

Detailed Troubleshooting: Health Check Failing

# Check if health endpoint is accessible
curl -v https://control-plane.so1.io/health

# Common issues:

# 1. Endpoint timing out
# Increase Railway health check timeout
railway service update control-plane-api --healthcheck-timeout 10

# 2. Health check path incorrect
railway service update control-plane-api --healthcheck-path /health

# 3. Service dependencies down
# Check each dependency separately
curl https://control-plane.so1.io/health | jq '.checks'

# If database is down:
railway shell --service control-plane-api
# Inside shell: check DB connection
psql $DATABASE_URL -c "SELECT 1"

# 4. Service crashed
railway logs --service control-plane-api --tail 100 | grep -i "error\|crash\|exit"


Best Practices

Health Checks

  1. Keep checks fast: Health checks should complete in <1s
  2. Check critical dependencies: Database, cache, external APIs
  3. Return meaningful status: Include component-level status
  4. Use proper status codes: 200 (healthy), 503 (degraded)
  5. Version your health checks: Include app version in response

Metrics Collection

  1. Collect at service boundaries: API requests, agent executions, DB queries
  2. Use percentiles, not averages: p50, p95, p99 for latency
  3. Tag metrics appropriately: service, environment, agent_id
  4. Don’t over-collect: Balance visibility with storage costs
  5. Aggregate over time: Reduce granularity for historical data

Alerting

  1. Alert on symptoms, not causes: High error rate (symptom) vs. disk full (cause)
  2. Set appropriate thresholds: Avoid alert fatigue from too many false positives
  3. Use alert severity levels: critical, warning, info
  4. Include actionable information: Link to runbooks, dashboards
  5. Test alerts regularly: Monthly test of critical alert paths

Dashboard Design

  1. Start with service health: Show overall system status prominently
  2. Group related metrics: Service health, infrastructure, business metrics
  3. Use consistent colors: Green (good), yellow (warning), red (critical)
  4. Add context: Thresholds, baselines, trends
  5. Make it actionable: Link to logs, traces, related dashboards