Overview
This runbook covers operational procedures for monitoring SO1 platform health, collecting metrics, and managing alerts. These procedures ensure proactive detection of issues, rapid response to incidents, and continuous visibility into system performance. Purpose: Provide step-by-step instructions for setting up monitoring, analyzing metrics, and responding to alerts Scope: Health checks, metrics collection, alerting rules, dashboard configuration, log analysis Target Audience: SREs, DevOps engineers, on-call operatorsPrerequisites
Required Access
Required Access
- Control Plane API access (
CONTROL_PLANE_API_KEY) - Railway project access (all services)
- Vercel project access (Console)
- n8n workflow access
- Slack workspace access (alert channels)
- Monitoring dashboard access (Grafana/DataDog)
Required Tools
Required Tools
curlor API client- Railway CLI (
railwaycommand) jqfor JSON parsing- Log analysis tools (
grep,awk) - Monitoring agents (if applicable)
Required Knowledge
Required Knowledge
- Understanding of SO1 architecture
- Familiarity with HTTP status codes and API health patterns
- Basic knowledge of metrics and observability
- Understanding of alert severity levels
Procedure 1: Configure Health Checks
Step 1: Implement Health Endpoints
All services should expose a/health endpoint:
Step 2: Configure Railway Health Checks
Step 3: Test Health Endpoints
Procedure 2: Set Up Metrics Collection
Step 1: Instrument Application Code
Step 2: Create Metrics Collection Workflow
Step 3: Query Metrics
Procedure 3: Configure Alerting Rules
Step 1: Define Alert Thresholds
Step 2: Create Alert Workflow
Step 3: Test Alerting
Procedure 4: Analyze Logs
Step 1: Access Service Logs
Step 2: Common Log Patterns
Error Rate Analysis:Step 3: Set Up Log Aggregation
For production systems, use centralized logging:Procedure 5: Create Monitoring Dashboard
Step 1: Define Dashboard Metrics
Key metrics to display:Service Health
- Uptime percentage
- Health check status
- Error rate
- Response time (p50, p95, p99)
Agent Performance
- Executions per minute
- Success rate
- Average duration
- Failed agents breakdown
Infrastructure
- CPU usage
- Memory usage
- Database connections
- Network throughput
Business Metrics
- Workflows created
- Active users
- API requests
- Feature usage
Step 2: Create Dashboard Configuration
Step 3: Access Dashboard
Verification Checklist
After setting up monitoring and alerting, verify:Troubleshooting
| Issue | Symptoms | Root Cause | Resolution |
|---|---|---|---|
| Health Check Failing | Railway marks service unhealthy | Service not responding, check timeout | Increase timeout, verify /health endpoint works, check service logs |
| Missing Metrics | Dashboard shows no data | Metrics collection workflow disabled | Check n8n workflow active status, verify metrics endpoint |
| Alerts Not Firing | No alerts despite high error rate | Alert threshold too high, rule disabled | Review alert configuration, test with lower threshold |
| Duplicate Alerts | Same alert firing repeatedly | No alert deduplication | Implement alert grouping, add cooldown period |
| Logs Not Showing | railway logs returns nothing | Service not writing to stdout | Update app logging to console.log, check Railway log drain |
| High Memory Usage | Service restarting frequently | Memory leak, insufficient resources | Analyze heap dumps, increase memory allocation |
| Slow Dashboard | Dashboard takes >10s to load | Too many metrics queries | Reduce query frequency, add caching, optimize queries |
Detailed Troubleshooting: Health Check Failing
Related Resources
Incident Response Runbook
Incident detection and response procedures
Deployment Runbook
Deployment procedures and health checks
DevOps Runbook
Railway operations and infrastructure
Backup & Recovery Runbook
Data backup and disaster recovery
Best Practices
Health Checks
- Keep checks fast: Health checks should complete in <1s
- Check critical dependencies: Database, cache, external APIs
- Return meaningful status: Include component-level status
- Use proper status codes: 200 (healthy), 503 (degraded)
- Version your health checks: Include app version in response
Metrics Collection
- Collect at service boundaries: API requests, agent executions, DB queries
- Use percentiles, not averages: p50, p95, p99 for latency
- Tag metrics appropriately: service, environment, agent_id
- Don’t over-collect: Balance visibility with storage costs
- Aggregate over time: Reduce granularity for historical data
Alerting
- Alert on symptoms, not causes: High error rate (symptom) vs. disk full (cause)
- Set appropriate thresholds: Avoid alert fatigue from too many false positives
- Use alert severity levels: critical, warning, info
- Include actionable information: Link to runbooks, dashboards
- Test alerts regularly: Monthly test of critical alert paths
Dashboard Design
- Start with service health: Show overall system status prominently
- Group related metrics: Service health, infrastructure, business metrics
- Use consistent colors: Green (good), yellow (warning), red (critical)
- Add context: Thresholds, baselines, trends
- Make it actionable: Link to logs, traces, related dashboards