Runbook Writer
Site reliability engineer creating operational runbooks for incident response, troubleshooting, and maintenance.
Quick Reference
| Property | Value |
|---|
| Domain | Documentation |
| FORGE Stage | 5 (VERIFY) |
| Version | 1.0.0 |
| Output Types | Runbook markdown |
Overview
Use this agent when you need to:
- Create incident response procedures for alerts
- Document troubleshooting steps for common issues
- Write maintenance procedure guides
- Define escalation paths and criteria
- Create copy-pasteable command references
- Validate operational readiness
The Runbook Writer creates step-by-step operational procedures that enable teams to respond effectively to incidents and perform maintenance safely.
Core Capabilities
Alert Runbooks
Create response procedures for each alert type
Troubleshooting Guides
Document diagnostic and resolution steps
Maintenance Procedures
Write step-by-step maintenance instructions
Escalation Documentation
Define when and how to escalate issues
When to Use
Alert rules defined, need response procedures
Common failure modes identified
Maintenance tasks need standardization
Escalation paths need documentation
On-call engineers need operational guidance
Post-incident reviews identify procedure gaps
Usage Examples
Incident Runbook
Maintenance Runbook
Troubleshooting Guide
Response procedure for high API error rate alert:# Runbook: API High Error Rate
## Overview
Addresses `api-error-rate-high` alert (5xx rate > 1% for 5 minutes)
## Severity
- **Level**: P2 (High)
- **Impact**: Users experiencing API failures, workflows may fail
- **SLO Impact**: Availability SLO (99.9% target)
- **Response Time**: 15 minutes
## Detection
**Alert**: `api-error-rate-high`
**Dashboard**: [API Health](https://grafana.so1.io/d/api-health)
**Threshold**: 5xx rate > 1% for 5 minutes
## Quick Assessment (2 minutes)
1. Check system health:
```bash
curl -s https://api.so1.io/health | jq .
```
Expected: `{"status": "healthy"}`
2. Check error rate trend in dashboard
3. Check recent deployments:
```bash
railway logs --service api | grep -i "deploy"
```
## Diagnosis
### Step 1: Identify Error Pattern
```bash
railway logs --service api --limit 100 | grep -E "ERROR|5[0-9]{2}"
```
Look for:
- Single endpoint vs. all endpoints
- Specific error messages
- Request correlation
### Step 2: Check Dependencies
**Database**:
```bash
railway run --service api -- node -e "
const { db } = require('./dist/db');
db.execute('SELECT 1').then(() => console.log('DB OK'));
"
```
**Redis**:
```bash
railway run --service api -- node -e "
const redis = require('./dist/redis').default;
redis.ping().then(() => console.log('Redis OK'));
"
```
### Step 3: Check Resources
```bash
railway status --service api
railway logs --service api | grep -i "oom\|killed"
```
## Resolution
### Scenario A: Database Issues
Symptoms: "connection refused", "timeout" errors
1. Check DB status: `railway status --service so1-db`
2. Restart if needed: `railway restart --service so1-db`
3. Monitor recovery on dashboard
### Scenario B: Memory Issues (OOM)
Symptoms: "Killed" in logs, 100% memory
1. Restart: `railway restart --service api`
2. Scale memory: Railway dashboard → Resources → 1GB
3. Create incident for investigation
### Scenario C: Bad Deployment
Symptoms: Errors started after deployment
1. List deployments: `railway deployments --service api`
2. Rollback: `railway rollback --service api --to <id>`
3. Verify: `curl https://api.so1.io/health`
## Escalation
### Escalate to P1 if:
- Error rate > 10% for 5+ minutes
- Complete service outage
- Unable to diagnose in 30 minutes
### Contacts:
1. On-call engineer: PagerDuty schedule
2. Backend lead: @backend-lead (Slack)
3. Infrastructure: @infra-team (Slack)
## Communication
**Slack #incidents**:
```
🔴 INCIDENT: API Error Rate High
Status: Investigating
Impact: Some API requests failing
ETA: Investigating
```
**Status Page**:
```
Investigating - API Performance Issues
Some requests may fail. Updates every 15 minutes.
```
## Post-Incident
1. Document timeline in ticket
2. Identify root cause
3. Create action items
4. Schedule post-mortem if P1/P2
## Prevention
- Implement circuit breakers
- Add connection pool monitoring
- Set up anomaly detection
- Regular load testing
Result: Complete incident response procedure with commands and escalation. Scheduled database maintenance procedure:# Runbook: PostgreSQL Maintenance
## Overview
Regular maintenance for SO1 PostgreSQL database
## Schedule
- **VACUUM ANALYZE**: Daily 03:00 UTC (automated)
- **REINDEX**: Weekly Sunday 04:00 UTC
- **Backup verification**: Weekly
## Pre-Maintenance Checklist
- [ ] No critical workflows scheduled
- [ ] Check database size and growth
- [ ] Ensure recent backup exists
- [ ] Notify team in #ops
## Procedures
### 1. Verify Automated Vacuum
```sql
-- Connect to database
railway connect --service so1-db
-- Check last vacuum time
SELECT schemaname, relname, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY last_autovacuum DESC NULLS LAST
LIMIT 10;
```
Expected: `last_autovacuum` within 24 hours
### 2. Manual VACUUM (if needed)
```sql
-- Specific table
VACUUM ANALYZE workflows;
-- Full database
VACUUM ANALYZE;
```
### 3. Check Index Health
```sql
SELECT
schemaname || '.' || relname AS table,
indexrelname AS index,
pg_size_pretty(pg_relation_size(indexrelid)) AS size,
idx_scan AS scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
```
### 4. Reindex (if needed)
```sql
-- Non-blocking (PG 12+)
REINDEX TABLE CONCURRENTLY workflows;
```
### 5. Verify Backups
```bash
railway backups --service so1-db --limit 5
railway backup-verify --service so1-db --latest
```
## Rollback
If issues occur:
1. Stop maintenance
2. Check locks: `SELECT pid, state FROM pg_stat_activity`
3. Cancel queries: `SELECT pg_cancel_backend(<pid>)`
## Success Criteria
- [ ] VACUUM completed without errors
- [ ] No tables with bloat > 20%
- [ ] Index scans healthy
- [ ] Backup verified
- [ ] No query latency increase
Result: Safe, repeatable maintenance procedure with verification steps. Common issues and resolution steps:# Troubleshooting: Workflow Execution Failures
## Common Issues
### Issue 1: Workflow Not Triggering
**Symptoms**: Scheduled workflow not executing
**Diagnosis**:
1. Check workflow status:
```bash
curl https://api.so1.io/api/v1/workflows/<id> \
-H "Authorization: Bearer <token>"
```
2. Verify cron expression at [crontab.guru](https://crontab.guru)
3. Check timezone configuration
**Resolution**:
- Ensure status is "active", not "draft"
- Fix invalid cron expression
- Verify timezone matches expectation
### Issue 2: n8n Connection Timeout
**Symptoms**: "ECONNREFUSED" or "timeout" errors
**Diagnosis**:
```bash
# Check n8n health
curl https://n8n.so1.io/healthz
# Check API logs
railway logs --service api | grep n8n
```
**Resolution**:
1. If n8n down: `railway restart --service n8n`
2. If n8n slow: Check n8n resource usage
3. If persistent: Enable circuit breaker
### Issue 3: Webhook Not Received
**Symptoms**: Webhook-triggered workflow not executing
**Diagnosis**:
1. Check webhook logs:
```bash
railway logs --service api | grep webhook
```
2. Verify webhook URL is correct
3. Check webhook signature validation
**Resolution**:
- Update webhook URL in source system
- Regenerate webhook secret if invalid
- Check firewall/network rules
## Getting Help
If troubleshooting doesn't resolve:
1. Gather diagnostic output
2. Create support ticket with details
3. Post in #engineering Slack for urgent issues
Result: Quick reference for common troubleshooting scenarios.
Outputs
Runbook Structure
All runbooks follow this standard format:
# Runbook: [Alert/Procedure Name]
## Overview
Brief description
## Severity (for incidents)
- Level: P1/P2/P3/P4
- Impact: User-facing impact
- SLO Impact: Affected SLOs
## Detection
How issue is detected
## Quick Assessment
2-minute triage steps
## Diagnosis
Step-by-step investigation
## Resolution
Step-by-step fix procedures
## Escalation
When and how to escalate
## Communication
Slack/status page templates
## Post-Incident
Follow-up actions
## Prevention
How to prevent recurrence
Command Standards
- Copy-pasteable: No manual substitution needed
- Expected output: Show what success looks like
- Environment variables: Use for secrets/config
- Multiple options: Provide Railway CLI and direct commands
FORGE Gate Compliance
Entry Gates
System architecture documented
Complete understanding of system components and dependencies.
Common failure modes identified
Known issues and their symptoms documented.
Monitoring and alerting in place
Alerts defined with dashboards and log aggregation.
Exit Gates
Runbooks created for each alert
Every critical alert has an associated runbook.
Commands are copy-pasteable
All commands work as written without modification.
Clear criteria and contacts for escalation.
Validated via tabletop exercises or dry runs.
| Agent | Relationship |
|---|
| Incident Commander | Uses runbooks during active incidents |
| Pipeline Auditor | Provides system health context |
| Railway Deployer | Source for deployment procedures |
Source Files
View Agent Source
Repository: so1-io/so1-agents
Path: agents/documentation/runbook-writer.md
Version: 1.0.0
Next Steps: