Runbook Writer

Site reliability engineer creating operational runbooks for incident response, troubleshooting, and maintenance.

Quick Reference

Property	Value
Domain	Documentation
FORGE Stage	5 (VERIFY)
Version	1.0.0
Output Types	Runbook markdown

Overview

Use this agent when you need to:

Create incident response procedures for alerts
Document troubleshooting steps for common issues
Write maintenance procedure guides
Define escalation paths and criteria
Create copy-pasteable command references
Validate operational readiness

The Runbook Writer creates step-by-step operational procedures that enable teams to respond effectively to incidents and perform maintenance safely.

Core Capabilities

Alert Runbooks

Create response procedures for each alert type

Troubleshooting Guides

Document diagnostic and resolution steps

Maintenance Procedures

Write step-by-step maintenance instructions

Escalation Documentation

Define when and how to escalate issues

When to Use

Alert rules defined, need response procedures

Common failure modes identified

Maintenance tasks need standardization

Escalation paths need documentation

On-call engineers need operational guidance

Post-incident reviews identify procedure gaps

Usage Examples

Incident Runbook
Maintenance Runbook
Troubleshooting Guide

Response procedure for high API error rate alert:

# Runbook: API High Error Rate

## Overview
Addresses `api-error-rate-high` alert (5xx rate > 1% for 5 minutes)

## Severity
- **Level**: P2 (High)
- **Impact**: Users experiencing API failures, workflows may fail
- **SLO Impact**: Availability SLO (99.9% target)
- **Response Time**: 15 minutes

## Detection
**Alert**: `api-error-rate-high`
**Dashboard**: [API Health](https://grafana.so1.io/d/api-health)
**Threshold**: 5xx rate > 1% for 5 minutes

## Quick Assessment (2 minutes)

1. Check system health:
   ```bash
   curl -s https://api.so1.io/health | jq .
   ```
   Expected: `{"status": "healthy"}`

2. Check error rate trend in dashboard
3. Check recent deployments:
   ```bash
   railway logs --service api | grep -i "deploy"
   ```

## Diagnosis

### Step 1: Identify Error Pattern
```bash
railway logs --service api --limit 100 | grep -E "ERROR|5[0-9]{2}"
```

Look for:
- Single endpoint vs. all endpoints
- Specific error messages
- Request correlation

### Step 2: Check Dependencies

**Database**:
```bash
railway run --service api -- node -e "
  const { db } = require('./dist/db');
  db.execute('SELECT 1').then(() => console.log('DB OK'));
"
```

**Redis**:
```bash
railway run --service api -- node -e "
  const redis = require('./dist/redis').default;
  redis.ping().then(() => console.log('Redis OK'));
"
```

### Step 3: Check Resources
```bash
railway status --service api
railway logs --service api | grep -i "oom\|killed"
```

## Resolution

### Scenario A: Database Issues
Symptoms: "connection refused", "timeout" errors

1. Check DB status: `railway status --service so1-db`
2. Restart if needed: `railway restart --service so1-db`
3. Monitor recovery on dashboard

### Scenario B: Memory Issues (OOM)
Symptoms: "Killed" in logs, 100% memory

1. Restart: `railway restart --service api`
2. Scale memory: Railway dashboard → Resources → 1GB
3. Create incident for investigation

### Scenario C: Bad Deployment
Symptoms: Errors started after deployment

1. List deployments: `railway deployments --service api`
2. Rollback: `railway rollback --service api --to <id>`
3. Verify: `curl https://api.so1.io/health`

## Escalation

### Escalate to P1 if:
- Error rate > 10% for 5+ minutes
- Complete service outage
- Unable to diagnose in 30 minutes

### Contacts:
1. On-call engineer: PagerDuty schedule
2. Backend lead: @backend-lead (Slack)
3. Infrastructure: @infra-team (Slack)

## Communication

**Slack #incidents**:
```
🔴 INCIDENT: API Error Rate High
Status: Investigating
Impact: Some API requests failing
ETA: Investigating
```

**Status Page**:
```
Investigating - API Performance Issues
Some requests may fail. Updates every 15 minutes.
```

## Post-Incident
1. Document timeline in ticket
2. Identify root cause
3. Create action items
4. Schedule post-mortem if P1/P2

## Prevention
- Implement circuit breakers
- Add connection pool monitoring
- Set up anomaly detection
- Regular load testing

Result: Complete incident response procedure with commands and escalation.

Scheduled database maintenance procedure:

# Runbook: PostgreSQL Maintenance

## Overview
Regular maintenance for SO1 PostgreSQL database

## Schedule
- **VACUUM ANALYZE**: Daily 03:00 UTC (automated)
- **REINDEX**: Weekly Sunday 04:00 UTC
- **Backup verification**: Weekly

## Pre-Maintenance Checklist
- [ ] No critical workflows scheduled
- [ ] Check database size and growth
- [ ] Ensure recent backup exists
- [ ] Notify team in #ops

## Procedures

### 1. Verify Automated Vacuum

```sql
-- Connect to database
railway connect --service so1-db

-- Check last vacuum time
SELECT schemaname, relname, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY last_autovacuum DESC NULLS LAST
LIMIT 10;
```

Expected: `last_autovacuum` within 24 hours

### 2. Manual VACUUM (if needed)

```sql
-- Specific table
VACUUM ANALYZE workflows;

-- Full database
VACUUM ANALYZE;
```

### 3. Check Index Health

```sql
SELECT
  schemaname || '.' || relname AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size,
  idx_scan AS scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
```

### 4. Reindex (if needed)

```sql
-- Non-blocking (PG 12+)
REINDEX TABLE CONCURRENTLY workflows;
```

### 5. Verify Backups

```bash
railway backups --service so1-db --limit 5
railway backup-verify --service so1-db --latest
```

## Rollback

If issues occur:
1. Stop maintenance
2. Check locks: `SELECT pid, state FROM pg_stat_activity`
3. Cancel queries: `SELECT pg_cancel_backend(<pid>)`

## Success Criteria
- [ ] VACUUM completed without errors
- [ ] No tables with bloat > 20%
- [ ] Index scans healthy
- [ ] Backup verified
- [ ] No query latency increase

Result: Safe, repeatable maintenance procedure with verification steps.

Common issues and resolution steps:

# Troubleshooting: Workflow Execution Failures

## Common Issues

### Issue 1: Workflow Not Triggering

**Symptoms**: Scheduled workflow not executing

**Diagnosis**:
1. Check workflow status:
   ```bash
   curl https://api.so1.io/api/v1/workflows/<id> \
     -H "Authorization: Bearer <token>"
   ```
2. Verify cron expression at [crontab.guru](https://crontab.guru)
3. Check timezone configuration

**Resolution**:
- Ensure status is "active", not "draft"
- Fix invalid cron expression
- Verify timezone matches expectation

### Issue 2: n8n Connection Timeout

**Symptoms**: "ECONNREFUSED" or "timeout" errors

**Diagnosis**:
```bash
# Check n8n health
curl https://n8n.so1.io/healthz

# Check API logs
railway logs --service api | grep n8n
```

**Resolution**:
1. If n8n down: `railway restart --service n8n`
2. If n8n slow: Check n8n resource usage
3. If persistent: Enable circuit breaker

### Issue 3: Webhook Not Received

**Symptoms**: Webhook-triggered workflow not executing

**Diagnosis**:
1. Check webhook logs:
   ```bash
   railway logs --service api | grep webhook
   ```
2. Verify webhook URL is correct
3. Check webhook signature validation

**Resolution**:
- Update webhook URL in source system
- Regenerate webhook secret if invalid
- Check firewall/network rules

## Getting Help

If troubleshooting doesn't resolve:
1. Gather diagnostic output
2. Create support ticket with details
3. Post in #engineering Slack for urgent issues

Result: Quick reference for common troubleshooting scenarios.

Outputs

Runbook Structure

All runbooks follow this standard format:

# Runbook: [Alert/Procedure Name]

## Overview
Brief description

## Severity (for incidents)
- Level: P1/P2/P3/P4
- Impact: User-facing impact
- SLO Impact: Affected SLOs

## Detection
How issue is detected

## Quick Assessment
2-minute triage steps

## Diagnosis
Step-by-step investigation

## Resolution
Step-by-step fix procedures

## Escalation
When and how to escalate

## Communication
Slack/status page templates

## Post-Incident
Follow-up actions

## Prevention
How to prevent recurrence

Command Standards

Copy-pasteable: No manual substitution needed
Expected output: Show what success looks like
Environment variables: Use for secrets/config
Multiple options: Provide Railway CLI and direct commands

FORGE Gate Compliance

Entry Gates

System architecture documented

Complete understanding of system components and dependencies.

Common failure modes identified

Known issues and their symptoms documented.

Monitoring and alerting in place

Alerts defined with dashboards and log aggregation.

Exit Gates

Runbooks created for each alert

Every critical alert has an associated runbook.

Commands are copy-pasteable

All commands work as written without modification.

Escalation paths defined

Clear criteria and contacts for escalation.

Runbooks tested

Validated via tabletop exercises or dry runs.

Agent	Relationship
Incident Commander	Uses runbooks during active incidents
Pipeline Auditor	Provides system health context
Railway Deployer	Source for deployment procedures

Source Files

View Agent Source

Repository: so1-io/so1-agents
Path: agents/documentation/runbook-writer.md
Version: 1.0.0

Next Steps:

Agents Overview

Orchestration

Automation

Engineering

DevOps

Documentation

Prompts

Incident

Runbook Writer

Runbook Writer

Quick Reference

Overview

Core Capabilities

Alert Runbooks

Troubleshooting Guides

Maintenance Procedures

Escalation Documentation

When to Use

Usage Examples

Outputs

Runbook Structure

Command Standards

FORGE Gate Compliance

Entry Gates

Exit Gates

Source Files

View Agent Source

Agents Overview

Orchestration

Automation

Engineering

DevOps

Documentation

Prompts

Incident

​Runbook Writer

​Quick Reference

​Overview

​Core Capabilities

Alert Runbooks

Troubleshooting Guides

Maintenance Procedures

Escalation Documentation

​When to Use

​Usage Examples

​Outputs

​Runbook Structure

​Command Standards

​FORGE Gate Compliance

​Entry Gates

​Exit Gates

​Related Agents

​Source Files

View Agent Source

Runbook Writer

Quick Reference

Overview

Core Capabilities

When to Use

Usage Examples

Outputs

Runbook Structure

Command Standards

FORGE Gate Compliance

Entry Gates

Exit Gates

Related Agents

Source Files