Skip to main content

Database Corruption Recovery

P0 - Critical Incident: This scenario typically results in complete service unavailability. Engage immediately.

Trigger Patterns

# Common error signatures that indicate this scenario
grep -E "(checksum mismatch|invalid page|PANIC.*checkpoint|recovery mode)" /var/log/postgresql/*.log
PatternSeverityLikely Cause
checksum mismatchCriticalStorage corruption, hardware failure
invalid page in blockCriticalDisk failure, incomplete write
PANIC: checkpointCriticalCatastrophic crash during write
recovery modeHighUnclean shutdown, may self-recover

Phase 1: Detect & Alert

AI Prompt: Initial Assessment

You are an on-call database reliability engineer responding to a PostgreSQL corruption alert.

**Context:**
- Service: {{SERVICE_NAME}}
- Database: {{DATABASE_NAME}}
- Error Log (last 100 lines):

{{ERROR_LOG}}

**Your task:**
1. Classify the corruption type (logical vs physical, recoverable vs catastrophic)
2. Identify the specific tables/indexes affected if determinable
3. Assess blast radius - what services depend on this database?
4. Recommend immediate containment actions (read-only mode, failover, etc.)

**Output format:**
- **Classification:** [type]
- **Affected Objects:** [list]
- **Blast Radius:** [services]
- **Immediate Actions:** [numbered list]
- **Confidence Level:** [high/medium/low] with reasoning

n8n Automation Hook

{
  "name": "DR-DB-Corruption-Detect",
  "webhook": {
    "path": "/dr-database-corruption/detect",
    "method": "POST",
    "responseMode": "lastNode"
  },
  "inputs": {
    "service_name": "string",
    "database_name": "string",
    "error_log": "string",
    "timestamp": "ISO8601"
  }
}

Phase 2: Diagnose

AI Prompt: Root Cause Analysis

You are a PostgreSQL internals expert performing root cause analysis on database corruption.

**Incident Context:**
- Initial Assessment: {{PHASE1_OUTPUT}}
- PostgreSQL Version: {{PG_VERSION}}
- Storage Type: {{STORAGE_TYPE}} (e.g., EBS, local NVMe, network storage)
- Recent Events:
  - Deployments: {{RECENT_DEPLOYMENTS}}
  - Config Changes: {{CONFIG_CHANGES}}
  - Infrastructure Events: {{INFRA_EVENTS}}

**Available Diagnostics:**
{{PG_CONTROLDATA_OUTPUT}}

{{DMESG_STORAGE_ERRORS}}

**Your task:**
1. Determine root cause with confidence percentage
2. Identify the corruption timeline (when did it start?)
3. Assess data loss scope (which transactions may be lost?)
4. Map the dependency chain that led to this failure

**Output as structured analysis:**
{
  "root_cause": {
    "primary": "string",
    "contributing_factors": ["string"],
    "confidence": 0.0-1.0
  },
  "timeline": {
    "corruption_started": "ISO8601",
    "detection_time": "ISO8601",
    "estimated_window": "duration"
  },
  "data_loss": {
    "scope": "none|minimal|moderate|severe",
    "affected_tables": ["string"],
    "transaction_window": "duration"
  }
}

Phase 3: Triage

AI Prompt: Recovery Path Selection

You are an incident commander deciding the recovery strategy for database corruption.

**Situation:**
- Root Cause Analysis: {{PHASE2_OUTPUT}}
- Available Backups:
  - Last full backup: {{LAST_FULL_BACKUP}} (age: {{BACKUP_AGE}})
  - WAL archives available: {{WAL_AVAILABLE}}
  - Standby replica status: {{REPLICA_STATUS}}
- Business Context:
  - RPO (acceptable data loss): {{RPO}}
  - RTO (acceptable downtime): {{RTO}}
  - Current downtime: {{CURRENT_DOWNTIME}}

**Recovery Options:**
1. **PITR (Point-in-Time Recovery)** - Restore to moment before corruption
2. **Failover to Replica** - Promote standby (if healthy)
3. **Selective Table Recovery** - Restore only affected tables
4. **pg_resetwal** - Force recovery (last resort, data loss likely)

**Your task:**
1. Rank recovery options by likelihood of success
2. Estimate time and data loss for each option
3. Recommend primary and fallback strategies
4. Identify go/no-go decision points

Decision Matrix


Phase 4: Execute

AI Prompt: Recovery Execution Guide

You are guiding an engineer through database recovery execution.

**Selected Strategy:** {{SELECTED_STRATEGY}}
**Environment:**
- Primary Host: {{PRIMARY_HOST}}
- Backup Location: {{BACKUP_LOCATION}}
- Target Recovery Point: {{TARGET_RPO}}

**Generate step-by-step execution commands for {{SELECTED_STRATEGY}}.**

Requirements:
1. Each step must be atomic and reversible where possible
2. Include verification commands after each step
3. Include rollback instructions if step fails
4. Estimate time for each step
5. Flag any steps requiring manual approval

Recovery Commands Quick Reference

# 1. Stop PostgreSQL
sudo systemctl stop postgresql

# 2. Backup current (corrupted) data directory
sudo mv /var/lib/postgresql/14/main /var/lib/postgresql/14/main.corrupted

# 3. Restore base backup
sudo tar -xzf /backups/base_backup_latest.tar.gz -C /var/lib/postgresql/14/

# 4. Configure recovery
cat > /var/lib/postgresql/14/main/postgresql.auto.conf << EOF
restore_command = 'cp /wal_archive/%f %p'
recovery_target_time = '{{TARGET_TIME}}'
recovery_target_action = 'promote'
EOF

# 5. Create recovery signal
touch /var/lib/postgresql/14/main/recovery.signal

# 6. Start PostgreSQL
sudo systemctl start postgresql

# 7. Monitor recovery
tail -f /var/log/postgresql/postgresql-14-main.log | grep -E "(recovery|restored|checkpoint)"

Phase 5: Verify

AI Prompt: Recovery Verification

Generate verification steps for database recovery.

**Recovery Strategy Used:** {{STRATEGY}}
**Target Recovery Point:** {{TARGET_RPO}}
**Affected Services:** {{SERVICES}}

**Create verification checklist for:**
1. Data integrity (checksums pass)
2. No missing transactions within RPO
3. All application queries working
4. Replication health (if applicable)
5. Backup chain restored

Verification Commands

# Check data integrity
pg_amcheck --all --verbose

# Verify table checksums
psql -c "SELECT relname, pg_relation_filepath(oid) FROM pg_class WHERE relkind = 'r' LIMIT 10;"

# Confirm replication status
psql -c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"

# Run application health checks
for service in api auth worker; do
  curl -f "https://${service}.internal/health"
done

Phase 6: Prevention

Prevention Checklist

immediate_actions:
  - action: "Enable data checksums"
    command: "initdb --data-checksums (requires reinit or pg_checksums)"
    
  - action: "Verify backup integrity"
    frequency: "Daily"
    command: "pg_restore --list {{BACKUP_FILE}} && pg_verifybackup"

monitoring_improvements:
  - action: "Add corruption detection alerts"
    metrics:
      - "PostgreSQL checksum failures"
      - "Storage I/O errors"
      - "Replication lag"
    
  - action: "Implement backup verification"
    frequency: "Weekly restore test to staging"

infrastructure_hardening:
  - action: "Enable synchronous replication"
    config: "synchronous_commit = on"
    
  - action: "Use checksummed storage"
    options: ["ZFS", "Btrfs", "EBS with checksums"]

Always test your backup restoration process monthly. The worst time to discover your backups don’t work is during an actual incident.