Triage Responder - SO1 Documentation

Quick Reference

Property	Value
Domain	Incident
FORGE Stage	Cross-cutting (first responder)
Version	1.0.0
Primary Output	Triage reports with severity classification and routing

Use this agent when you need to:

Evaluate incoming alerts for validity and urgency
Classify incident severity using SEV0-SEV4 scale
Assess impact scope and affected user count
Route incidents to appropriate responder teams

Core Capabilities

Alert Assessment

Evaluates incoming alerts for validity, urgency, and correlation with existing issues

Severity Classification

Assigns appropriate SEV0-SEV4 level based on impact, scope, and urgency

Impact Analysis

Determines blast radius, affected services, and estimated user impact

Routing Decision

Directs incidents to appropriate responders with urgency guidance

When to Use

Ideal Use Cases

New alerts from monitoring systems (Datadog, PagerDuty, Sentry)

Manual incident reports from users or support team

Anomaly detection triggers requiring classification

Multi-symptom scenarios needing correlation and assessment

Unknown incidents requiring initial investigation guidance

Not Recommended For

Usage Examples

Database Connection Pool
Complete Outage
False Positive
Performance Degradation

SEV2: Connection Pool Exhaustion

Alert: API error rate >5%, response latency spikeTriage Report

{
  "type": "triage-report",
  "timestamp": "2024-01-15T14:07:00Z",
  "content": {
    "alert_id": "pd-alert-789xyz",
    "alert_source": "pagerduty",
    "received_at": "2024-01-15T14:05:00Z",
    "triage_completed_at": "2024-01-15T14:07:00Z",
    "classification": {
      "is_incident": true,
      "severity": "SEV2",
      "confidence": 0.85,
      "category": "availability"
    },
    "symptoms": [
      {
        "symptom": "HTTP 5xx error rate elevated",
        "source": "Datadog APM",
        "started_at": "2024-01-15T14:03:00Z",
        "current_value": "12%",
        "threshold": "< 1%"
      },
      {
        "symptom": "API response latency p99 spike",
        "source": "Datadog APM",
        "current_value": "4500ms",
        "threshold": "< 500ms"
      },
      {
        "symptom": "Database connection pool utilization",
        "source": "Railway metrics",
        "current_value": "100%",
        "threshold": "< 80%"
      }
    ],
    "impact": {
      "users_affected": "many",
      "user_count_estimate": 150,
      "services_affected": ["so1-control-plane-api"],
      "regions_affected": ["us-east-1"],
      "functionality_impact": "Workflow executions failing, API requests timing out"
    },
    "initial_hypothesis": [
      {
        "hypothesis": "Database connection pool exhaustion due to traffic spike",
        "confidence": 0.75,
        "supporting_evidence": [
          "Connection pool at 100%",
          "Errors correlate with pool exhaustion",
          "Traffic 2x normal levels"
        ],
        "investigation_steps": [
          "Check for long-running queries",
          "Review connection pool configuration",
          "Check for connection leaks"
        ]
      }
    ],
    "routing": {
      "escalate_to_incident": true,
      "suggested_commander": "oncall-backend",
      "teams_to_involve": ["backend", "platform"],
      "urgency": "immediate"
    },
    "context": {
      "recent_deployments": [
        {
          "service": "so1-control-plane-api",
          "deployed_at": "2024-01-15T10:30:00Z",
          "commit": "abc123"
        }
      ],
      "recent_changes": ["New workflow bulk execution feature deployed"]
    }
  }
}

Triage Decision:

Severity: SEV2 (significant degradation, many users affected)
Confidence: 85% (clear symptoms, known pattern)
Routing: Immediate escalation to backend on-call
Hypothesis: Connection pool exhaustion (75% confidence)
Investigation: Check connection pool config, long-running queries

SEV0: Database Cluster Failure

Alert: All services reporting database connection failuresTriage Report

{
  "type": "triage-report",
  "timestamp": "2024-01-20T09:16:00Z",
  "content": {
    "alert_id": "dd-critical-001",
    "alert_source": "datadog",
    "received_at": "2024-01-20T09:15:00Z",
    "triage_completed_at": "2024-01-20T09:16:00Z",
    "classification": {
      "is_incident": true,
      "severity": "SEV0",
      "confidence": 0.99,
      "category": "availability"
    },
    "symptoms": [
      {
        "symptom": "All services reporting database connection failures",
        "source": "Datadog APM",
        "started_at": "2024-01-20T09:15:00Z",
        "current_value": "100% error rate",
        "threshold": "< 0.1%"
      },
      {
        "symptom": "Database cluster health check failing",
        "source": "Railway monitoring",
        "started_at": "2024-01-20T09:15:00Z",
        "current_value": "0/3 nodes responding",
        "threshold": "3/3 nodes"
      }
    ],
    "impact": {
      "users_affected": "all",
      "user_count_estimate": "all active users",
      "services_affected": ["all"],
      "regions_affected": ["all"],
      "functionality_impact": "Complete service outage - all functionality unavailable"
    },
    "initial_hypothesis": [
      {
        "hypothesis": "Database cluster failure - all nodes unresponsive",
        "confidence": 0.95,
        "supporting_evidence": [
          "All nodes failing health checks",
          "No database connections successful",
          "All services affected simultaneously"
        ],
        "investigation_steps": [
          "Check Railway dashboard for cluster status",
          "Verify network connectivity",
          "Prepare failover to secondary cluster"
        ]
      }
    ],
    "routing": {
      "escalate_to_incident": true,
      "suggested_commander": "engineering-lead",
      "teams_to_involve": ["platform", "backend", "devops", "executive"],
      "urgency": "immediate"
    },
    "context": {
      "recent_deployments": [],
      "recent_changes": [],
      "related_alerts": ["database-cpu-high (30 minutes ago)"],
      "known_issues": []
    }
  }
}

Triage Decision:

Severity: SEV0 (complete outage, all users affected)
Confidence: 99% (unambiguous symptoms)
Routing: Immediate escalation to engineering leadership + executive team
Hypothesis: Database cluster failure (95% confidence)
Action: All-hands response, prepare failover

Non-Incident: Transient Spike

Alert: Error rate >2% for 30 secondsTriage Report

{
  "type": "triage-report",
  "timestamp": "2024-01-18T11:32:00Z",
  "content": {
    "alert_id": "pd-alert-456abc",
    "alert_source": "pagerduty",
    "received_at": "2024-01-18T11:30:00Z",
    "triage_completed_at": "2024-01-18T11:32:00Z",
    "classification": {
      "is_incident": false,
      "severity": "none",
      "confidence": 0.90,
      "category": "transient"
    },
    "symptoms": [
      {
        "symptom": "Error rate spike",
        "source": "Datadog",
        "started_at": "2024-01-18T11:29:45Z",
        "current_value": "0.1% (resolved)",
        "threshold": "< 2%"
      }
    ],
    "impact": {
      "users_affected": "none",
      "user_count_estimate": 0,
      "services_affected": [],
      "functionality_impact": "No sustained impact observed"
    },
    "initial_hypothesis": [
      {
        "hypothesis": "Transient network blip or retry storm",
        "confidence": 0.85,
        "supporting_evidence": [
          "Spike lasted &lt;30 seconds",
          "Error rate returned to baseline",
          "No related symptoms",
          "No user reports"
        ],
        "investigation_steps": [
          "Monitor for recurrence",
          "Review logs if pattern repeats"
        ]
      }
    ],
    "routing": {
      "escalate_to_incident": false,
      "suggested_commander": "none",
      "teams_to_involve": [],
      "urgency": "none"
    },
    "context": {
      "recent_deployments": [],
      "recent_changes": [],
      "related_alerts": [],
      "known_issues": []
    }
  }
}

Triage Decision:

Severity: None (not an incident)
Confidence: 90% (transient, self-resolved)
Routing: No escalation, close alert
Monitoring: Watch for recurrence

SEV3: Slow Query Pattern

Alert: API p95 latency >2sTriage Report

{
  "type": "triage-report",
  "timestamp": "2024-01-19T14:22:00Z",
  "content": {
    "alert_id": "dd-perf-123",
    "alert_source": "datadog",
    "received_at": "2024-01-19T14:20:00Z",
    "triage_completed_at": "2024-01-19T14:22:00Z",
    "classification": {
      "is_incident": true,
      "severity": "SEV3",
      "confidence": 0.80,
      "category": "performance"
    },
    "symptoms": [
      {
        "symptom": "Workflow listing endpoint slow",
        "source": "Datadog APM",
        "started_at": "2024-01-19T14:15:00Z",
        "current_value": "2.8s p95",
        "threshold": "< 1.5s"
      },
      {
        "symptom": "Database query duration elevated",
        "source": "Database monitoring",
        "current_value": "2.5s avg",
        "threshold": "< 500ms"
      }
    ],
    "impact": {
      "users_affected": "some",
      "user_count_estimate": 25,
      "services_affected": ["workflow-api"],
      "functionality_impact": "Slow workflow loading, but functional"
    },
    "initial_hypothesis": [
      {
        "hypothesis": "N+1 query pattern or missing database index",
        "confidence": 0.70,
        "supporting_evidence": [
          "Query duration matches latency increase",
          "Specific endpoint affected",
          "Recent feature added pagination"
        ],
        "investigation_steps": [
          "Review query patterns for N+1",
          "Check database indexes",
          "Analyze slow query logs"
        ]
      }
    ],
    "routing": {
      "escalate_to_incident": true,
      "suggested_commander": "oncall-backend",
      "teams_to_involve": ["backend"],
      "urgency": "standard"
    }
  }
}

Triage Decision:

Severity: SEV3 (degraded but functional, limited users)
Confidence: 80% (clear performance pattern)
Routing: Standard escalation to backend on-call
Hypothesis: Query optimization needed (70% confidence)

Output Format

Triage Report Schema

interface TriageReport {
  type: "triage-report";
  version: "1.0.0";
  generated_by: "triage-responder";
  timestamp: string; // ISO8601
  
  content: {
    alert_id: string;
    alert_source: "pagerduty" | "datadog" | "sentry" | "manual" | "n8n";
    received_at: string;
    triage_completed_at: string;
    
    classification: {
      is_incident: boolean;
      severity: "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4" | "none";
      confidence: number; // 0.0 - 1.0
      category: "availability" | "performance" | "security" | "data" | "functional";
    };
    
    symptoms: Array<{
      symptom: string;
      source: string;
      started_at: string;
      current_value: string;
      threshold: string;
    }>;
    
    impact: {
      users_affected: "none" | "some" | "many" | "all";
      user_count_estimate: number | string;
      services_affected: string[];
      regions_affected: string[];
      functionality_impact: string;
    };
    
    initial_hypothesis: Array<{
      hypothesis: string;
      confidence: number;
      supporting_evidence: string[];
      investigation_steps: string[];
    }>;
    
    routing: {
      escalate_to_incident: boolean;
      suggested_commander: string;
      teams_to_involve: string[];
      urgency: "immediate" | "urgent" | "standard" | "none";
    };
    
    context: {
      recent_deployments: Array<object>;
      recent_changes: string[];
      related_alerts: string[];
      known_issues: string[];
    };
  };
}

Severity Classification Matrix

SEV0 - Critical (Immediate Response)

Indicator	Threshold	Example
Service Availability	Complete outage	All services down
User Impact	All users affected	100% error rate
Data Impact	Data loss or corruption	Database failure
Security	Active breach	Unauthorized access detected
Revenue	Direct revenue loss	Payment processing down

Response: All-hands, executive notification, customer communication

SEV1 - Major (15 min response)

Indicator	Threshold	Example
Service Availability	Major feature unavailable	Auth service down
User Impact	>50% users affected	Critical API failing
Data Impact	Data integrity concerns	Sync failures
Security	Active vulnerability	Exploit in progress
Revenue	Significant revenue impact	Checkout broken

Response: Senior engineer + management, customer updates

SEV2 - Significant (1 hour response)

Indicator	Threshold	Example
Service Availability	Degraded performance	High latency
User Impact	10-50% users affected	Regional outage
Data Impact	Potential data issues	Validation errors
Security	Potential vulnerability	Unpatched CVE
Revenue	Indirect revenue impact	Feature unavailable

Response: On-call engineer, internal notifications

SEV3 - Minor (4 hour response)

Indicator	Threshold	Example
Service Availability	Partial degradation	Slow endpoint
User Impact	<10% users affected	Edge case failures
Data Impact	No data impact	Display issues only
Security	Low-risk issue	Info disclosure
Revenue	Minimal impact	Analytics delayed

Response: Standard queue, on-call engineer

SEV4 - Cosmetic (Best effort)

Indicator	Threshold	Example
Service Availability	Cosmetic issues	UI alignment off
User Impact	Individual users	Single user report
Data Impact	None	N/A
Security	Informational	Security scan finding
Revenue	None	N/A

Response: Backlog, any engineer

FORGE Gate Compliance

Entry Gates (Pre-conditions)

Before invoking this agent, ensure:

Alert detected: Automated alert or manual report received
Metrics accessible: System metrics and health endpoints available
Health endpoints responding: Or confirmed down with error messages

Verification: Factory Orchestrator confirms alert data completeness

Exit Gates (Post-conditions)

This agent completes successfully when:

Severity assigned: SEV0-4 or non-incident classification
Impact assessed: User and service impact estimated
Services identified: Affected services and regions listed
Routing complete: Incident Commander notified if warranted (SEV0-3)
Triage documented: Full triage report produced
Decision record logged: Classification rationale in ADR format

Verification: Gatekeeper validates triage completeness before handoff

Decision Record Format

All severity classification decisions are logged as:

date:2024-01-15T14:07:00Z|context:API error rate 12%, connection pool 100%|decision:Classify as SEV2 vs SEV1|rationale:Service degraded but not fully down, 150 users affected (&lt;50%)|consequences:Immediate on-call escalation but not executive notification|status:accepted

Integration Points

Control Plane API

Used Endpoints:

GET /api/v1/health - Overall service health status
GET /api/v1/health/dependencies - Dependency health (database, Redis, etc.)
GET /api/v1/workflows/stats - Workflow execution metrics
GET /api/v1/deployments/recent - Recent deployment history

Veritas Prompt Library

Consumes:

vrt-triage01: Triage decision trees and severity criteria
vrt-symptoms01: Symptom-to-cause mapping patterns
vrt-impact01: Impact assessment frameworks

Produces:

Novel triage patterns in veritas/agent-prompts/incident/
Status: draft (requires review)

Agent	Relationship	Integration Point
Incident Commander	Downstream	Hands off classified incidents for orchestration
Factory Orchestrator	Upstream	Receives alerts, routes to Triage Responder
Postmortem Analyst	Peer	May provide historical incident patterns for classification

Workflow Process

Alert Reception

Receive and validate incoming alert

Validate alert is not duplicate
Check for stale alerts
Capture initial symptoms
Record timestamp

Symptom Collection

Gather additional symptoms and context

Collect related metrics
Check service health
Identify recent changes
Correlate related alerts

Severity Classification

Classify incident severity

Apply decision tree
Calculate confidence score
Assign category
Document rationale

Impact Assessment

Determine scope and blast radius

Estimate user impact
List affected services
Describe functionality impact
Identify regions affected

Routing Decision

Decide routing and escalation

Determine if incident warranted
Assign commander
Notify teams
Generate triage report

Error Handling

Common Issues

Low Confidence Classification (<70%)Cause: Ambiguous symptoms, conflicting signalsResolution: Escalate to Incident Commander with “investigation required” status, provide multiple hypotheses

Missing Context DataCause: Health endpoints unavailable, metrics not accessibleResolution: Use available data, document gaps, escalate if critical service affected

Alert Storm (Multiple Simultaneous)Cause: Cascading failure triggering many alertsResolution: Correlate alerts, identify root service, classify as single incident

Escalation Path

If Triage Responder cannot complete assessment:

Escalate to Incident Commander with partial triage
Document confidence gaps and missing data
Default to higher severity if ambiguous (better safe than sorry)
Log decision record with classification uncertainty

Success Metrics

Metric	Target	Critical Threshold
Triage Latency	<2 minutes	>5 minutes
Classification Accuracy	>95%	<90%
False Positive Rate	<5%	>10%
Confidence Score	>80%	<70%

Source Files

View Agent Source

Maintained in so1-agents repository under agents/incident/triage-responder.md

Agents Overview

Orchestration

Automation

Engineering

DevOps

Documentation

Prompts

Incident

​Quick Reference

​Core Capabilities

Alert Assessment

Severity Classification

Impact Analysis

Routing Decision

​When to Use

​Ideal Use Cases

​Not Recommended For

​Usage Examples

​SEV2: Connection Pool Exhaustion

​SEV0: Database Cluster Failure

​Non-Incident: Transient Spike

​SEV3: Slow Query Pattern

​Output Format

​Triage Report Schema

​Severity Classification Matrix

​SEV0 - Critical (Immediate Response)

​SEV1 - Major (15 min response)

​SEV2 - Significant (1 hour response)

​SEV3 - Minor (4 hour response)

​SEV4 - Cosmetic (Best effort)

​FORGE Gate Compliance

​Integration Points

​Control Plane API

​Veritas Prompt Library

​Related Agents

​Workflow Process

​Error Handling

​Common Issues

​Escalation Path

​Success Metrics

​Source Files

View Agent Source

Quick Reference

Core Capabilities

When to Use

Ideal Use Cases

Not Recommended For

Usage Examples

SEV2: Connection Pool Exhaustion

SEV0: Database Cluster Failure

Non-Incident: Transient Spike

SEV3: Slow Query Pattern

Output Format

Triage Report Schema

Severity Classification Matrix

SEV0 - Critical (Immediate Response)

SEV1 - Major (15 min response)

SEV2 - Significant (1 hour response)

SEV3 - Minor (4 hour response)

SEV4 - Cosmetic (Best effort)

FORGE Gate Compliance

Integration Points

Control Plane API

Veritas Prompt Library

Related Agents

Workflow Process

Error Handling

Common Issues

Escalation Path

Success Metrics

Source Files