Skip to main content

Runbook Writer

Site reliability engineer creating operational runbooks for incident response, troubleshooting, and maintenance.

Quick Reference

PropertyValue
DomainDocumentation
FORGE Stage5 (VERIFY)
Version1.0.0
Output TypesRunbook markdown

Overview

Use this agent when you need to:
  • Create incident response procedures for alerts
  • Document troubleshooting steps for common issues
  • Write maintenance procedure guides
  • Define escalation paths and criteria
  • Create copy-pasteable command references
  • Validate operational readiness
The Runbook Writer creates step-by-step operational procedures that enable teams to respond effectively to incidents and perform maintenance safely.

Core Capabilities

Alert Runbooks

Create response procedures for each alert type

Troubleshooting Guides

Document diagnostic and resolution steps

Maintenance Procedures

Write step-by-step maintenance instructions

Escalation Documentation

Define when and how to escalate issues

When to Use

Alert rules defined, need response procedures
Common failure modes identified
Maintenance tasks need standardization
Escalation paths need documentation
On-call engineers need operational guidance
Post-incident reviews identify procedure gaps

Usage Examples

Response procedure for high API error rate alert:
# Runbook: API High Error Rate

## Overview
Addresses `api-error-rate-high` alert (5xx rate > 1% for 5 minutes)

## Severity
- **Level**: P2 (High)
- **Impact**: Users experiencing API failures, workflows may fail
- **SLO Impact**: Availability SLO (99.9% target)
- **Response Time**: 15 minutes

## Detection
**Alert**: `api-error-rate-high`
**Dashboard**: [API Health](https://grafana.so1.io/d/api-health)
**Threshold**: 5xx rate > 1% for 5 minutes

## Quick Assessment (2 minutes)

1. Check system health:
   ```bash
   curl -s https://api.so1.io/health | jq .
   ```
   Expected: `{"status": "healthy"}`

2. Check error rate trend in dashboard
3. Check recent deployments:
   ```bash
   railway logs --service api | grep -i "deploy"
   ```

## Diagnosis

### Step 1: Identify Error Pattern
```bash
railway logs --service api --limit 100 | grep -E "ERROR|5[0-9]{2}"
```

Look for:
- Single endpoint vs. all endpoints
- Specific error messages
- Request correlation

### Step 2: Check Dependencies

**Database**:
```bash
railway run --service api -- node -e "
  const { db } = require('./dist/db');
  db.execute('SELECT 1').then(() => console.log('DB OK'));
"
```

**Redis**:
```bash
railway run --service api -- node -e "
  const redis = require('./dist/redis').default;
  redis.ping().then(() => console.log('Redis OK'));
"
```

### Step 3: Check Resources
```bash
railway status --service api
railway logs --service api | grep -i "oom\|killed"
```

## Resolution

### Scenario A: Database Issues
Symptoms: "connection refused", "timeout" errors

1. Check DB status: `railway status --service so1-db`
2. Restart if needed: `railway restart --service so1-db`
3. Monitor recovery on dashboard

### Scenario B: Memory Issues (OOM)
Symptoms: "Killed" in logs, 100% memory

1. Restart: `railway restart --service api`
2. Scale memory: Railway dashboard → Resources → 1GB
3. Create incident for investigation

### Scenario C: Bad Deployment
Symptoms: Errors started after deployment

1. List deployments: `railway deployments --service api`
2. Rollback: `railway rollback --service api --to <id>`
3. Verify: `curl https://api.so1.io/health`

## Escalation

### Escalate to P1 if:
- Error rate > 10% for 5+ minutes
- Complete service outage
- Unable to diagnose in 30 minutes

### Contacts:
1. On-call engineer: PagerDuty schedule
2. Backend lead: @backend-lead (Slack)
3. Infrastructure: @infra-team (Slack)

## Communication

**Slack #incidents**:
```
🔴 INCIDENT: API Error Rate High
Status: Investigating
Impact: Some API requests failing
ETA: Investigating
```

**Status Page**:
```
Investigating - API Performance Issues
Some requests may fail. Updates every 15 minutes.
```

## Post-Incident
1. Document timeline in ticket
2. Identify root cause
3. Create action items
4. Schedule post-mortem if P1/P2

## Prevention
- Implement circuit breakers
- Add connection pool monitoring
- Set up anomaly detection
- Regular load testing
Result: Complete incident response procedure with commands and escalation.

Outputs

Runbook Structure

All runbooks follow this standard format:
# Runbook: [Alert/Procedure Name]

## Overview
Brief description

## Severity (for incidents)
- Level: P1/P2/P3/P4
- Impact: User-facing impact
- SLO Impact: Affected SLOs

## Detection
How issue is detected

## Quick Assessment
2-minute triage steps

## Diagnosis
Step-by-step investigation

## Resolution
Step-by-step fix procedures

## Escalation
When and how to escalate

## Communication
Slack/status page templates

## Post-Incident
Follow-up actions

## Prevention
How to prevent recurrence

Command Standards

  • Copy-pasteable: No manual substitution needed
  • Expected output: Show what success looks like
  • Environment variables: Use for secrets/config
  • Multiple options: Provide Railway CLI and direct commands

FORGE Gate Compliance

Entry Gates

Complete understanding of system components and dependencies.
Known issues and their symptoms documented.
Alerts defined with dashboards and log aggregation.

Exit Gates

Every critical alert has an associated runbook.
All commands work as written without modification.
Clear criteria and contacts for escalation.
Validated via tabletop exercises or dry runs.
AgentRelationship
Incident CommanderUses runbooks during active incidents
Pipeline AuditorProvides system health context
Railway DeployerSource for deployment procedures

Source Files

View Agent Source

Repository: so1-io/so1-agents
Path: agents/documentation/runbook-writer.md
Version: 1.0.0

Next Steps: