Database Corruption Recovery
P0 - Critical Incident: This scenario typically results in complete service unavailability. Engage immediately.
Trigger Patterns
Copy
# Common error signatures that indicate this scenario
grep -E "(checksum mismatch|invalid page|PANIC.*checkpoint|recovery mode)" /var/log/postgresql/*.log
| Pattern | Severity | Likely Cause |
|---|---|---|
checksum mismatch | Critical | Storage corruption, hardware failure |
invalid page in block | Critical | Disk failure, incomplete write |
PANIC: checkpoint | Critical | Catastrophic crash during write |
recovery mode | High | Unclean shutdown, may self-recover |
Phase 1: Detect & Alert
AI Prompt: Initial Assessment
Copy
You are an on-call database reliability engineer responding to a PostgreSQL corruption alert.
**Context:**
- Service: {{SERVICE_NAME}}
- Database: {{DATABASE_NAME}}
- Error Log (last 100 lines):
{{ERROR_LOG}}
**Your task:**
1. Classify the corruption type (logical vs physical, recoverable vs catastrophic)
2. Identify the specific tables/indexes affected if determinable
3. Assess blast radius - what services depend on this database?
4. Recommend immediate containment actions (read-only mode, failover, etc.)
**Output format:**
- **Classification:** [type]
- **Affected Objects:** [list]
- **Blast Radius:** [services]
- **Immediate Actions:** [numbered list]
- **Confidence Level:** [high/medium/low] with reasoning
n8n Automation Hook
- Webhook Trigger
- n8n Workflow
Copy
{
"name": "DR-DB-Corruption-Detect",
"webhook": {
"path": "/dr-database-corruption/detect",
"method": "POST",
"responseMode": "lastNode"
},
"inputs": {
"service_name": "string",
"database_name": "string",
"error_log": "string",
"timestamp": "ISO8601"
}
}
Copy
{
"name": "DR-DB-Corruption-Detect",
"nodes": [
{
"id": "webhook-trigger",
"type": "n8n-nodes-base.webhook",
"parameters": {
"path": "dr-database-corruption/detect",
"responseMode": "lastNode"
}
},
{
"id": "claude-assess",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://api.anthropic.com/v1/messages",
"method": "POST",
"headers": {
"x-api-key": "={{$env.ANTHROPIC_API_KEY}}",
"content-type": "application/json",
"anthropic-version": "2023-06-01"
},
"body": {
"model": "claude-sonnet-4-20250514",
"max_tokens": 2000,
"messages": [{ "role": "user", "content": "..." }]
}
}
},
{
"id": "slack-alert",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#incidents",
"text": "DATABASE CORRUPTION DETECTED\n\nSeverity: {{$json.severity}}\nIncident ID: {{$json.incident_id}}"
}
}
]
}
Phase 2: Diagnose
AI Prompt: Root Cause Analysis
Copy
You are a PostgreSQL internals expert performing root cause analysis on database corruption.
**Incident Context:**
- Initial Assessment: {{PHASE1_OUTPUT}}
- PostgreSQL Version: {{PG_VERSION}}
- Storage Type: {{STORAGE_TYPE}} (e.g., EBS, local NVMe, network storage)
- Recent Events:
- Deployments: {{RECENT_DEPLOYMENTS}}
- Config Changes: {{CONFIG_CHANGES}}
- Infrastructure Events: {{INFRA_EVENTS}}
**Available Diagnostics:**
{{PG_CONTROLDATA_OUTPUT}}
{{DMESG_STORAGE_ERRORS}}
**Your task:**
1. Determine root cause with confidence percentage
2. Identify the corruption timeline (when did it start?)
3. Assess data loss scope (which transactions may be lost?)
4. Map the dependency chain that led to this failure
**Output as structured analysis:**
{
"root_cause": {
"primary": "string",
"contributing_factors": ["string"],
"confidence": 0.0-1.0
},
"timeline": {
"corruption_started": "ISO8601",
"detection_time": "ISO8601",
"estimated_window": "duration"
},
"data_loss": {
"scope": "none|minimal|moderate|severe",
"affected_tables": ["string"],
"transaction_window": "duration"
}
}
Phase 3: Triage
AI Prompt: Recovery Path Selection
Copy
You are an incident commander deciding the recovery strategy for database corruption.
**Situation:**
- Root Cause Analysis: {{PHASE2_OUTPUT}}
- Available Backups:
- Last full backup: {{LAST_FULL_BACKUP}} (age: {{BACKUP_AGE}})
- WAL archives available: {{WAL_AVAILABLE}}
- Standby replica status: {{REPLICA_STATUS}}
- Business Context:
- RPO (acceptable data loss): {{RPO}}
- RTO (acceptable downtime): {{RTO}}
- Current downtime: {{CURRENT_DOWNTIME}}
**Recovery Options:**
1. **PITR (Point-in-Time Recovery)** - Restore to moment before corruption
2. **Failover to Replica** - Promote standby (if healthy)
3. **Selective Table Recovery** - Restore only affected tables
4. **pg_resetwal** - Force recovery (last resort, data loss likely)
**Your task:**
1. Rank recovery options by likelihood of success
2. Estimate time and data loss for each option
3. Recommend primary and fallback strategies
4. Identify go/no-go decision points
Decision Matrix
Phase 4: Execute
AI Prompt: Recovery Execution Guide
Copy
You are guiding an engineer through database recovery execution.
**Selected Strategy:** {{SELECTED_STRATEGY}}
**Environment:**
- Primary Host: {{PRIMARY_HOST}}
- Backup Location: {{BACKUP_LOCATION}}
- Target Recovery Point: {{TARGET_RPO}}
**Generate step-by-step execution commands for {{SELECTED_STRATEGY}}.**
Requirements:
1. Each step must be atomic and reversible where possible
2. Include verification commands after each step
3. Include rollback instructions if step fails
4. Estimate time for each step
5. Flag any steps requiring manual approval
Recovery Commands Quick Reference
- PITR Recovery
- Replica Failover
- Selective Table Restore
- pg_resetwal (Last Resort)
Copy
# 1. Stop PostgreSQL
sudo systemctl stop postgresql
# 2. Backup current (corrupted) data directory
sudo mv /var/lib/postgresql/14/main /var/lib/postgresql/14/main.corrupted
# 3. Restore base backup
sudo tar -xzf /backups/base_backup_latest.tar.gz -C /var/lib/postgresql/14/
# 4. Configure recovery
cat > /var/lib/postgresql/14/main/postgresql.auto.conf << EOF
restore_command = 'cp /wal_archive/%f %p'
recovery_target_time = '{{TARGET_TIME}}'
recovery_target_action = 'promote'
EOF
# 5. Create recovery signal
touch /var/lib/postgresql/14/main/recovery.signal
# 6. Start PostgreSQL
sudo systemctl start postgresql
# 7. Monitor recovery
tail -f /var/log/postgresql/postgresql-14-main.log | grep -E "(recovery|restored|checkpoint)"
Copy
# 1. Verify replica health
psql -h {{REPLICA_HOST}} -c "SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
# 2. Stop writes to primary (if accessible)
psql -h {{PRIMARY_HOST}} -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state != 'idle';"
# 3. Promote replica
psql -h {{REPLICA_HOST}} -c "SELECT pg_promote();"
# 4. Verify promotion
psql -h {{REPLICA_HOST}} -c "SELECT pg_is_in_recovery();" # Should return 'f'
# 5. Update connection strings
# (Application-specific - update DNS, connection pools, etc.)
# 6. Verify application connectivity
curl -f https://api.yourapp.com/health
Copy
# 1. Identify affected tables
psql -c "SELECT relname FROM pg_class WHERE relfilenode IN ({{CORRUPTED_FILENODE_LIST}});"
# 2. Export table structure
pg_dump -s -t {{TABLE_NAME}} > table_schema.sql
# 3. Restore specific table from backup
pg_restore -d {{DATABASE}} -t {{TABLE_NAME}} {{BACKUP_FILE}}
# 4. Verify row counts match
psql -c "SELECT COUNT(*) FROM {{TABLE_NAME}};"
DANGER: This command can cause data loss. Only use when all other options are exhausted.
Copy
# 1. STOP - Ensure you have taken every possible backup
cp -r /var/lib/postgresql/14/main /var/lib/postgresql/14/main.backup.$(date +%Y%m%d)
# 2. Run pg_resetwal
pg_resetwal -f /var/lib/postgresql/14/main
# 3. Start PostgreSQL
sudo systemctl start postgresql
# 4. Immediately run integrity checks
pg_amcheck --all
# 5. Document ALL missing transactions for business reconciliation
Phase 5: Verify
AI Prompt: Recovery Verification
Copy
Generate verification steps for database recovery.
**Recovery Strategy Used:** {{STRATEGY}}
**Target Recovery Point:** {{TARGET_RPO}}
**Affected Services:** {{SERVICES}}
**Create verification checklist for:**
1. Data integrity (checksums pass)
2. No missing transactions within RPO
3. All application queries working
4. Replication health (if applicable)
5. Backup chain restored
Verification Commands
Copy
# Check data integrity
pg_amcheck --all --verbose
# Verify table checksums
psql -c "SELECT relname, pg_relation_filepath(oid) FROM pg_class WHERE relkind = 'r' LIMIT 10;"
# Confirm replication status
psql -c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"
# Run application health checks
for service in api auth worker; do
curl -f "https://${service}.internal/health"
done
Phase 6: Prevention
Prevention Checklist
Copy
immediate_actions:
- action: "Enable data checksums"
command: "initdb --data-checksums (requires reinit or pg_checksums)"
- action: "Verify backup integrity"
frequency: "Daily"
command: "pg_restore --list {{BACKUP_FILE}} && pg_verifybackup"
monitoring_improvements:
- action: "Add corruption detection alerts"
metrics:
- "PostgreSQL checksum failures"
- "Storage I/O errors"
- "Replication lag"
- action: "Implement backup verification"
frequency: "Weekly restore test to staging"
infrastructure_hardening:
- action: "Enable synchronous replication"
config: "synchronous_commit = on"
- action: "Use checksummed storage"
options: ["ZFS", "Btrfs", "EBS with checksums"]
Related Scenarios
- Certificate Expiry - TLS/SSL incidents
- Security Breach - When database compromise is suspected
Always test your backup restoration process monthly. The worst time to discover your backups don’t work is during an actual incident.