Monitoring & Alerting - SO1 Documentation

Overview

This runbook covers operational procedures for monitoring SO1 platform health, collecting metrics, and managing alerts. These procedures ensure proactive detection of issues, rapid response to incidents, and continuous visibility into system performance. Purpose: Provide step-by-step instructions for setting up monitoring, analyzing metrics, and responding to alerts Scope: Health checks, metrics collection, alerting rules, dashboard configuration, log analysis Target Audience: SREs, DevOps engineers, on-call operators

Prerequisites

Required Access

Control Plane API access (CONTROL_PLANE_API_KEY)
Railway project access (all services)
Vercel project access (Console)
n8n workflow access
Slack workspace access (alert channels)
Monitoring dashboard access (Grafana/DataDog)

Required Tools

curl or API client
Railway CLI (railway command)
jq for JSON parsing
Log analysis tools (grep, awk)
Monitoring agents (if applicable)

Required Knowledge

Understanding of SO1 architecture
Familiarity with HTTP status codes and API health patterns
Basic knowledge of metrics and observability
Understanding of alert severity levels

Procedure 1: Configure Health Checks

Step 1: Implement Health Endpoints

All services should expose a /health endpoint:

// Example: Hono health endpoint
import { Hono } from 'hono';

const app = new Hono();

app.get('/health', async (c) => {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkExternalAPIs(),
  ]);

  const health = {
    status: checks.every(r => r.status === 'fulfilled') ? 'healthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.APP_VERSION || 'unknown',
    checks: {
      database: checks[0].status === 'fulfilled' ? 'up' : 'down',
      redis: checks[1].status === 'fulfilled' ? 'up' : 'down',
      external_apis: checks[2].status === 'fulfilled' ? 'up' : 'down',
    },
  };

  const statusCode = health.status === 'healthy' ? 200 : 503;
  return c.json(health, statusCode);
});

async function checkDatabase() {
  // Simple query to verify DB connection
  await db.execute('SELECT 1');
}

async function checkRedis() {
  await redis.ping();
}

async function checkExternalAPIs() {
  // Check critical external dependencies
  const response = await fetch('https://api.openai.com/v1/models', {
    headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` }
  });
  if (!response.ok) throw new Error('OpenAI API unavailable');
}

Step 2: Configure Railway Health Checks

# Set health check in Railway service settings
railway service update control-plane-api \
  --healthcheck-path /health \
  --healthcheck-interval 30

# Verify health check configuration
railway service show control-plane-api | grep -i health

Step 3: Test Health Endpoints

# Test all service health endpoints
SERVICES=(
  "https://control-plane.so1.io"
  "https://console.so1.io"
  "https://n8n.so1.io"
)

for service in "${SERVICES[@]}"; do
  echo "Testing: $service/health"
  response=$(curl -s -w "\n%{http_code}" "$service/health")
  body=$(echo "$response" | head -n -1)
  status=$(echo "$response" | tail -n 1)
  
  echo "Status: $status"
  echo "$body" | jq '.'
  echo "---"
done

Expected Healthy Response:

{
  "status": "healthy",
  "timestamp": "2026-03-10T15:00:00Z",
  "version": "1.2.3",
  "checks": {
    "database": "up",
    "redis": "up",
    "external_apis": "up"
  }
}

Procedure 2: Set Up Metrics Collection

Step 1: Instrument Application Code

// Example: Add metrics to Hono app
import { Hono } from 'hono';
import { timing } from 'hono/timing';

const app = new Hono();

// Request timing middleware
app.use('*', timing());

// Custom metrics endpoint
app.get('/metrics', async (c) => {
  const metrics = {
    timestamp: Date.now(),
    requests: {
      total: await getMetric('requests_total'),
      success: await getMetric('requests_success'),
      errors: await getMetric('requests_errors'),
    },
    response_times: {
      p50: await getMetric('response_time_p50'),
      p95: await getMetric('response_time_p95'),
      p99: await getMetric('response_time_p99'),
    },
    agents: {
      executions_total: await getMetric('agent_executions_total'),
      executions_success: await getMetric('agent_executions_success'),
      executions_failed: await getMetric('agent_executions_failed'),
    },
    database: {
      connections_active: await getMetric('db_connections_active'),
      query_duration_avg: await getMetric('db_query_duration_avg'),
    },
  };

  return c.json(metrics);
});

// Increment metrics on each request
app.use('*', async (c, next) => {
  await incrementMetric('requests_total');
  const start = Date.now();
  
  try {
    await next();
    await incrementMetric('requests_success');
  } catch (error) {
    await incrementMetric('requests_errors');
    throw error;
  } finally {
    const duration = Date.now() - start;
    await recordMetric('response_time', duration);
  }
});

Step 2: Create Metrics Collection Workflow

# Create n8n workflow to collect metrics periodically
curl -X POST https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Metrics Collection",
    "nodes": [
      {
        "name": "Schedule",
        "type": "n8n-nodes-base.scheduleTrigger",
        "parameters": {
          "cronExpression": "*/1 * * * *"
        }
      },
      {
        "name": "Fetch Control Plane Metrics",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/metrics",
          "method": "GET"
        }
      },
      {
        "name": "Store Metrics",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://metrics-db.so1.io/api/v1/metrics",
          "method": "POST",
          "bodyParameters": {
            "service": "control-plane",
            "metrics": "={{$json}}"
          }
        }
      }
    ],
    "active": true
  }'

Step 3: Query Metrics

# Get recent metrics for Control Plane
curl -s https://control-plane.so1.io/metrics \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" | jq '.'

# Get aggregated metrics over time
curl -s "https://metrics-db.so1.io/api/v1/query?service=control-plane&metric=response_time_p95&timerange=1h" \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq '.datapoints'

# Get agent execution metrics
curl -s "https://control-plane.so1.io/api/v1/analytics/agent-executions?timeframe=24h" \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq '{
    total_executions: .total,
    success_rate: .success_rate,
    avg_duration_ms: .avg_duration_ms,
    by_agent: .by_agent
  }'

Procedure 3: Configure Alerting Rules

Step 1: Define Alert Thresholds

// Alert configuration
interface AlertRule {
  name: string;
  metric: string;
  condition: 'above' | 'below' | 'equals';
  threshold: number;
  duration: string; // e.g., "5m" = trigger if condition persists for 5 minutes
  severity: 'critical' | 'warning' | 'info';
  channels: string[];
}

const alertRules: AlertRule[] = [
  {
    name: 'High Error Rate',
    metric: 'error_rate',
    condition: 'above',
    threshold: 0.05, // 5%
    duration: '5m',
    severity: 'critical',
    channels: ['slack_engineering', 'pagerduty'],
  },
  {
    name: 'Slow API Response',
    metric: 'response_time_p95',
    condition: 'above',
    threshold: 2000, // 2000ms
    duration: '10m',
    severity: 'warning',
    channels: ['slack_engineering'],
  },
  {
    name: 'Database Connection Pool Exhausted',
    metric: 'db_connections_active',
    condition: 'above',
    threshold: 90, // 90% of max connections
    duration: '3m',
    severity: 'critical',
    channels: ['slack_engineering', 'pagerduty'],
  },
  {
    name: 'Agent Execution Failure Spike',
    metric: 'agent_failures_per_minute',
    condition: 'above',
    threshold: 10,
    duration: '5m',
    severity: 'warning',
    channels: ['slack_engineering'],
  },
];

Step 2: Create Alert Workflow

# Create n8n workflow for alerting
curl -X POST https://n8n.so1.io/api/v1/workflows \
  -H "X-N8N-API-KEY: ${N8N_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Alert Manager",
    "nodes": [
      {
        "name": "Schedule Check",
        "type": "n8n-nodes-base.scheduleTrigger",
        "parameters": {
          "cronExpression": "*/1 * * * *"
        }
      },
      {
        "name": "Fetch Metrics",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://control-plane.so1.io/metrics"
        }
      },
      {
        "name": "Evaluate Alert Rules",
        "type": "n8n-nodes-base.function",
        "parameters": {
          "functionCode": "// Check if metrics exceed thresholds\nconst metrics = items[0].json;\nconst alerts = [];\n\nif (metrics.requests.errors / metrics.requests.total > 0.05) {\n  alerts.push({ rule: \"High Error Rate\", severity: \"critical\" });\n}\n\nif (metrics.response_times.p95 > 2000) {\n  alerts.push({ rule: \"Slow API Response\", severity: \"warning\" });\n}\n\nreturn alerts.map(a => ({ json: a }));"
        }
      },
      {
        "name": "Send Slack Alert",
        "type": "n8n-nodes-base.slack",
        "parameters": {
          "channel": "#engineering-alerts",
          "text": "🚨 Alert: {{$json.rule}} ({{$json.severity}})"
        }
      }
    ],
    "active": true
  }'

Step 3: Test Alerting

# Trigger test alert
curl -X POST https://n8n.so1.io/webhook/alerts/test \
  -H "Content-Type: application/json" \
  -d '{
    "rule": "Test Alert",
    "severity": "info",
    "message": "Testing alert delivery system",
    "timestamp": "'"$(date -Iseconds)"'"
  }'

# Verify alert received in Slack
# Check #engineering-alerts channel

Procedure 4: Analyze Logs

Step 1: Access Service Logs

# Railway logs for Control Plane
railway logs --service control-plane-api --environment production --tail 100

# Filter for errors
railway logs --service control-plane-api --environment production \
  | grep -i "error\|exception\|fail"

# Get logs for specific time range
railway logs --service control-plane-api \
  --since "2026-03-10T14:00:00Z" \
  --until "2026-03-10T15:00:00Z"

Step 2: Common Log Patterns

Error Rate Analysis:

# Count errors per minute in last hour
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "ERROR" \
  | awk '{print $1}' \
  | cut -d: -f1,2 \
  | sort | uniq -c

# Find most common error messages
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "ERROR" \
  | awk -F'ERROR' '{print $2}' \
  | sort | uniq -c | sort -rn | head -10

Slow Request Analysis:

# Find requests taking >2s
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "duration_ms" \
  | awk '{for(i=1;i<=NF;i++){if($i~/duration_ms/){print $(i+1)}}}' \
  | awk -F'[":,]' '$2 > 2000 {print}' \
  | wc -l

Agent Execution Analysis:

# Get agent execution stats from logs
railway logs --service control-plane-api --since "1 hour ago" \
  | grep "agent_execution" \
  | jq -r '[.agent_id, .duration_ms, .status] | @tsv' \
  | awk '{
    agent[$1]++;
    total_time[$1]+=$2;
    if($3=="success") success[$1]++;
  } END {
    for(a in agent) {
      print a, agent[a], success[a], int(total_time[a]/agent[a]);
    }
  }' | column -t

Step 3: Set Up Log Aggregation

For production systems, use centralized logging:

# Example: Configure log shipping to external service
# (DataDog, New Relic, Logtail, etc.)

# Railway automatically ships logs to stdout
# Configure log drain in Railway dashboard:
# Project Settings → Integrations → Add Log Drain

# Verify log drain working
curl -s https://logs-api.logtail.com/query \
  -H "Authorization: Bearer ${LOGTAIL_API_KEY}" \
  -d '{
    "query": "service:control-plane-api level:error",
    "timerange": "1h"
  }' | jq '.results | length'

Procedure 5: Create Monitoring Dashboard

Step 1: Define Dashboard Metrics

Key metrics to display:

Service Health

Uptime percentage
Health check status
Error rate
Response time (p50, p95, p99)

Agent Performance

Executions per minute
Success rate
Average duration
Failed agents breakdown

Infrastructure

CPU usage
Memory usage
Database connections
Network throughput

Business Metrics

Workflows created
Active users
API requests
Feature usage

Step 2: Create Dashboard Configuration

{
  "dashboard": {
    "title": "SO1 Platform Overview",
    "refresh_interval": "30s",
    "panels": [
      {
        "title": "Service Health",
        "type": "status_grid",
        "queries": [
          {
            "name": "Control Plane",
            "endpoint": "https://control-plane.so1.io/health"
          },
          {
            "name": "Console",
            "endpoint": "https://console.so1.io/health"
          },
          {
            "name": "n8n",
            "endpoint": "https://n8n.so1.io/health"
          }
        ]
      },
      {
        "title": "Request Rate",
        "type": "line_chart",
        "query": "SELECT time, count(*) FROM requests GROUP BY time(1m)",
        "timerange": "1h"
      },
      {
        "title": "Error Rate",
        "type": "line_chart",
        "query": "SELECT time, (errors / total) * 100 FROM requests GROUP BY time(1m)",
        "threshold": {
          "warning": 1,
          "critical": 5
        }
      },
      {
        "title": "Response Time (p95)",
        "type": "line_chart",
        "query": "SELECT time, percentile(duration_ms, 95) FROM requests GROUP BY time(1m)",
        "threshold": {
          "warning": 1000,
          "critical": 2000
        }
      }
    ]
  }
}

Step 3: Access Dashboard

# Dashboard URL (example with Grafana)
echo "Dashboard: https://grafana.so1.io/d/platform-overview"

# Or create simple status page
curl -s https://control-plane.so1.io/api/v1/status/dashboard \
  -H "Authorization: Bearer ${CONTROL_PLANE_API_KEY}" \
  | jq '.'

Verification Checklist

After setting up monitoring and alerting, verify:

Troubleshooting

Issue	Symptoms	Root Cause	Resolution
Health Check Failing	Railway marks service unhealthy	Service not responding, check timeout	Increase timeout, verify `/health` endpoint works, check service logs
Missing Metrics	Dashboard shows no data	Metrics collection workflow disabled	Check n8n workflow active status, verify metrics endpoint
Alerts Not Firing	No alerts despite high error rate	Alert threshold too high, rule disabled	Review alert configuration, test with lower threshold
Duplicate Alerts	Same alert firing repeatedly	No alert deduplication	Implement alert grouping, add cooldown period
Logs Not Showing	`railway logs` returns nothing	Service not writing to stdout	Update app logging to console.log, check Railway log drain
High Memory Usage	Service restarting frequently	Memory leak, insufficient resources	Analyze heap dumps, increase memory allocation
Slow Dashboard	Dashboard takes >10s to load	Too many metrics queries	Reduce query frequency, add caching, optimize queries

Detailed Troubleshooting: Health Check Failing

# Check if health endpoint is accessible
curl -v https://control-plane.so1.io/health

# Common issues:

# 1. Endpoint timing out
# Increase Railway health check timeout
railway service update control-plane-api --healthcheck-timeout 10

# 2. Health check path incorrect
railway service update control-plane-api --healthcheck-path /health

# 3. Service dependencies down
# Check each dependency separately
curl https://control-plane.so1.io/health | jq '.checks'

# If database is down:
railway shell --service control-plane-api
# Inside shell: check DB connection
psql $DATABASE_URL -c "SELECT 1"

# 4. Service crashed
railway logs --service control-plane-api --tail 100 | grep -i "error\|crash\|exit"

Incident Response Runbook

Incident detection and response procedures

Deployment Runbook

Deployment procedures and health checks

DevOps Runbook

Railway operations and infrastructure

Backup & Recovery Runbook

Data backup and disaster recovery

Best Practices

Health Checks

Keep checks fast: Health checks should complete in <1s
Check critical dependencies: Database, cache, external APIs
Return meaningful status: Include component-level status
Use proper status codes: 200 (healthy), 503 (degraded)
Version your health checks: Include app version in response

Metrics Collection

Collect at service boundaries: API requests, agent executions, DB queries
Use percentiles, not averages: p50, p95, p99 for latency
Tag metrics appropriately: service, environment, agent_id
Don’t over-collect: Balance visibility with storage costs
Aggregate over time: Reduce granularity for historical data

Alerting

Alert on symptoms, not causes: High error rate (symptom) vs. disk full (cause)
Set appropriate thresholds: Avoid alert fatigue from too many false positives
Use alert severity levels: critical, warning, info
Include actionable information: Link to runbooks, dashboards
Test alerts regularly: Monthly test of critical alert paths

Dashboard Design

Start with service health: Show overall system status prominently
Group related metrics: Service health, infrastructure, business metrics
Use consistent colors: Green (good), yellow (warning), red (critical)
Add context: Thresholds, baselines, trends
Make it actionable: Link to logs, traces, related dashboards

Runbooks

Domain Runbooks

​Overview

​Prerequisites

​Procedure 1: Configure Health Checks

​Step 1: Implement Health Endpoints

​Step 2: Configure Railway Health Checks

​Step 3: Test Health Endpoints

​Procedure 2: Set Up Metrics Collection

​Step 1: Instrument Application Code

​Step 2: Create Metrics Collection Workflow

​Step 3: Query Metrics

​Procedure 3: Configure Alerting Rules

​Step 1: Define Alert Thresholds

​Step 2: Create Alert Workflow

​Step 3: Test Alerting

​Procedure 4: Analyze Logs

​Step 1: Access Service Logs

​Step 2: Common Log Patterns

​Step 3: Set Up Log Aggregation

​Procedure 5: Create Monitoring Dashboard

​Step 1: Define Dashboard Metrics