# Postmortem: API Gateway Connection Pool Exhaustion**Incident ID**: INC-20240115-0042**Date**: 2024-01-15**Severity**: SEV2**Duration**: 25 minutes**Authors**: @oncall-backend, @postmortem-analyst## SummaryOn January 15, 2024, the SO1 Control Plane API experienced elevated error rates (12% 5xx) for 25 minutes due to database connection pool exhaustion. Approximately 150 users were affected, experiencing failed workflow executions and API timeouts.### Impact- 150 users affected- ~200 workflow executions failed- API error rate peaked at 12%- No data loss## Timeline| Time (UTC) | Event | Significance ||------------|-------|--------------|| 14:02 | Connection pool hits 100% utilization | First sign || 14:05 | PagerDuty alert fires | Detection || 14:08 | Incident declared | Response initiated || 14:15 | Root cause identified | Investigation complete || 14:20 | Pool size increased 20→50 | Mitigation applied || 14:25 | Error rate normal | Impact ends |## Root Cause Analysis### Primary Root CauseThe database connection pool was configured with a maximum of 20 connections, which was insufficient for the concurrent load generated by the new bulk workflow execution feature deployed earlier that day.### 5 Whys Analysis1. **Why did the API return 5xx errors?** → Because database queries were timing out2. **Why were queries timing out?** → Because no database connections were available3. **Why were no connections available?** → Because the connection pool was exhausted (20/20 in use)4. **Why was the pool exhausted?** → Because bulk execution opened many concurrent connections5. **Why wasn't the pool sized for this load?** → Because connection requirements weren't load tested before deployment### Contributing Factors| Factor | Category | Preventable ||--------|----------|-------------|| No load testing for bulk execution | process | ✓ || Pool size not documented | process | ✓ || No alert on pool utilization | technology | ✓ || Deployed during peak hours | process | ✓ |## What Went Well- Alert fired within 2 minutes- Quick incident declaration (3 min)- Root cause identified in 10 minutes- Non-disruptive mitigation (no restart)- Clear communication throughout## What Went Poorly- No proactive pool monitoring- Load testing missed bulk operations- Pool configuration undocumented- Peak-hour deployment without gradual rollout## Where We Got Lucky- Fix worked immediately- No data inconsistencies- Experienced engineer on-call## Action Items| ID | Type | Description | Owner | Priority | Due ||----|------|-------------|-------|----------|-----|| AI-001 | detect | Add monitor for pool utilization >80% | @platform | P1 | 2024-01-17 || AI-002 | prevent | Document pool sizing in runbook | @backend | P2 | 2024-01-19 || AI-003 | prevent | Add load testing for bulk ops to CI | @qa | P1 | 2024-01-26 || AI-004 | process | Require gradual rollout for features | @eng-mgr | P2 | 2024-01-31 || AI-005 | prevent | Review all service connection pools | @platform | P2 | 2024-02-15 |## Lessons Learned1. **Connection pools need explicit sizing** - Defaults are often insufficient2. **Load test new features** - Especially those changing concurrency3. **Monitor before you need it** - Pool metrics should be day-one alerts4. **Deploy during low-traffic** - Or use gradual rollout## Metrics| Metric | Value ||--------|-------|| Time to Detect | 5 min || Time to Mitigate | 15 min || Time to Resolve | 25 min || Responders | 2 |## Tags`database` `connection-pool` `capacity` `deployment` `bulk-operations`
Key Insights:
Root cause: Insufficient connection pool sizing
Process gap: No load testing for concurrency changes
# Postmortem: API Response Time Degradation**Incident ID**: INC-20240118-0023**Severity**: SEV3**Duration**: 2 hours (slow mitigation)## Root CauseN+1 query pattern introduced in the workflow listing endpoint during a pagination refactor. Each workflow loaded its execution history in a separate query, resulting in hundreds of queries per page load.## Pattern RecognitionThis is the **third N+1 query incident in 6 months**:1. INC-20231015-0008: User listing endpoint2. INC-20231201-0019: Organization dashboard3. INC-20240118-0023: Workflow listing (current)**Common Pattern**: Code reviews missed performance implications of ORM queries during pagination refactors.## Root Cause (Organizational)While the technical root cause is the N+1 query, the **organizational root cause** is the absence of query performance review in our code review checklist and lack of automated detection.## Action Items| ID | Type | Description | Owner | Priority ||----|------|-------------|-------|----------|| AI-001 | detect | Add query count monitoring per endpoint | @platform | P1 || AI-002 | prevent | Add ORM query checklist to PR template | @eng-mgr | P1 || AI-003 | prevent | Implement query count lint rule in CI | @platform | P0 || AI-004 | prevent | Review all listing endpoints for N+1 | @backend | P2 || AI-005 | process | Schedule quarterly ORM training | @eng-mgr | P2 |## Lessons Learned1. **Recurring patterns need systemic fixes** - Individual fixes aren't enough2. **Automated detection > manual review** - Humans miss things under pressure3. **Query performance isn't obvious** - ORM abstraction hides performance4. **Training matters** - Team needs ongoing education on performance patterns## Related Incidents- INC-20231015-0008: N+1 in user listing (SEV3)- INC-20231201-0019: N+1 in org dashboard (SEV3)**Recommendation**: Create runbook for "N+1 Query Detection and Prevention"
Key Insights:
Identified recurring pattern (3rd occurrence)
Distinguished technical vs organizational root cause
1. Why did X happen? → Because Y2. Why did Y happen? → Because Z3. Why did Z happen? → Because W4. Why did W happen? → Because V5. Why did V happen? → [Root cause]
Example: Connection pool exhaustion (see Usage Examples)
Incident resolved: Status set to “closed” by Incident Commander
Record available: Complete incident record with timeline
Responders identified: Key participants available for input
Data accessible: Logs, metrics, code changes retrievable
Verification: Factory Orchestrator confirms incident closure and data availability
Exit Gates (Post-conditions)
This agent completes successfully when:
Root cause identified: Primary cause documented with high confidence
Factors documented: All contributing factors categorized
Timeline complete: Full chronology with significance annotations
Action items created: Specific tasks with owners and due dates
Report published: Postmortem document available in knowledge base
Decision record logged: Analysis rationale in ADR format
Verification: Gatekeeper validates postmortem completeness and action item clarity
Decision Record Format
All significant postmortem conclusions are logged as:
Copy
date:2024-01-15T16:00:00Z|context:Analyzing connection pool exhaustion incident|decision:Root cause is undersized pool vs application bug|rationale:5 Whys leads to deployment without load testing, pool size was consequence|consequences:Action items focus on testing and monitoring, not just pool tuning|status:accepted
Insufficient Data for RCACause: Logs rotated, metrics not retained, responders unavailableResolution: Document data gaps, use available evidence, note confidence level <70%
Multiple Plausible Root CausesCause: Complex incident with unclear causalityResolution: Use Fishbone diagram, document all contributing factors, assign confidence levels
Recurrence of Known IssuesCause: Previous action items not completedResolution: Highlight pattern, escalate incomplete action items, recommend process changes