Skip to main content

Quick Reference

PropertyValue
DomainIncident
FORGE StageCross-cutting (first responder)
Version1.0.0
Primary OutputTriage reports with severity classification and routing
Use this agent when you need to:
  • Evaluate incoming alerts for validity and urgency
  • Classify incident severity using SEV0-SEV4 scale
  • Assess impact scope and affected user count
  • Route incidents to appropriate responder teams

Core Capabilities

Alert Assessment

Evaluates incoming alerts for validity, urgency, and correlation with existing issues

Severity Classification

Assigns appropriate SEV0-SEV4 level based on impact, scope, and urgency

Impact Analysis

Determines blast radius, affected services, and estimated user impact

Routing Decision

Directs incidents to appropriate responders with urgency guidance

When to Use

Ideal Use Cases

New alerts from monitoring systems (Datadog, PagerDuty, Sentry)
Manual incident reports from users or support team
Anomaly detection triggers requiring classification
Multi-symptom scenarios needing correlation and assessment
Unknown incidents requiring initial investigation guidance

Usage Examples

SEV2: Connection Pool Exhaustion

Alert: API error rate >5%, response latency spikeTriage Report
{
  "type": "triage-report",
  "timestamp": "2024-01-15T14:07:00Z",
  "content": {
    "alert_id": "pd-alert-789xyz",
    "alert_source": "pagerduty",
    "received_at": "2024-01-15T14:05:00Z",
    "triage_completed_at": "2024-01-15T14:07:00Z",
    "classification": {
      "is_incident": true,
      "severity": "SEV2",
      "confidence": 0.85,
      "category": "availability"
    },
    "symptoms": [
      {
        "symptom": "HTTP 5xx error rate elevated",
        "source": "Datadog APM",
        "started_at": "2024-01-15T14:03:00Z",
        "current_value": "12%",
        "threshold": "< 1%"
      },
      {
        "symptom": "API response latency p99 spike",
        "source": "Datadog APM",
        "current_value": "4500ms",
        "threshold": "< 500ms"
      },
      {
        "symptom": "Database connection pool utilization",
        "source": "Railway metrics",
        "current_value": "100%",
        "threshold": "< 80%"
      }
    ],
    "impact": {
      "users_affected": "many",
      "user_count_estimate": 150,
      "services_affected": ["so1-control-plane-api"],
      "regions_affected": ["us-east-1"],
      "functionality_impact": "Workflow executions failing, API requests timing out"
    },
    "initial_hypothesis": [
      {
        "hypothesis": "Database connection pool exhaustion due to traffic spike",
        "confidence": 0.75,
        "supporting_evidence": [
          "Connection pool at 100%",
          "Errors correlate with pool exhaustion",
          "Traffic 2x normal levels"
        ],
        "investigation_steps": [
          "Check for long-running queries",
          "Review connection pool configuration",
          "Check for connection leaks"
        ]
      }
    ],
    "routing": {
      "escalate_to_incident": true,
      "suggested_commander": "oncall-backend",
      "teams_to_involve": ["backend", "platform"],
      "urgency": "immediate"
    },
    "context": {
      "recent_deployments": [
        {
          "service": "so1-control-plane-api",
          "deployed_at": "2024-01-15T10:30:00Z",
          "commit": "abc123"
        }
      ],
      "recent_changes": ["New workflow bulk execution feature deployed"]
    }
  }
}
Triage Decision:
  • Severity: SEV2 (significant degradation, many users affected)
  • Confidence: 85% (clear symptoms, known pattern)
  • Routing: Immediate escalation to backend on-call
  • Hypothesis: Connection pool exhaustion (75% confidence)
  • Investigation: Check connection pool config, long-running queries

Output Format

Triage Report Schema

interface TriageReport {
  type: "triage-report";
  version: "1.0.0";
  generated_by: "triage-responder";
  timestamp: string; // ISO8601
  
  content: {
    alert_id: string;
    alert_source: "pagerduty" | "datadog" | "sentry" | "manual" | "n8n";
    received_at: string;
    triage_completed_at: string;
    
    classification: {
      is_incident: boolean;
      severity: "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4" | "none";
      confidence: number; // 0.0 - 1.0
      category: "availability" | "performance" | "security" | "data" | "functional";
    };
    
    symptoms: Array<{
      symptom: string;
      source: string;
      started_at: string;
      current_value: string;
      threshold: string;
    }>;
    
    impact: {
      users_affected: "none" | "some" | "many" | "all";
      user_count_estimate: number | string;
      services_affected: string[];
      regions_affected: string[];
      functionality_impact: string;
    };
    
    initial_hypothesis: Array<{
      hypothesis: string;
      confidence: number;
      supporting_evidence: string[];
      investigation_steps: string[];
    }>;
    
    routing: {
      escalate_to_incident: boolean;
      suggested_commander: string;
      teams_to_involve: string[];
      urgency: "immediate" | "urgent" | "standard" | "none";
    };
    
    context: {
      recent_deployments: Array<object>;
      recent_changes: string[];
      related_alerts: string[];
      known_issues: string[];
    };
  };
}

Severity Classification Matrix

SEV0 - Critical (Immediate Response)

IndicatorThresholdExample
Service AvailabilityComplete outageAll services down
User ImpactAll users affected100% error rate
Data ImpactData loss or corruptionDatabase failure
SecurityActive breachUnauthorized access detected
RevenueDirect revenue lossPayment processing down
Response: All-hands, executive notification, customer communication

SEV1 - Major (15 min response)

IndicatorThresholdExample
Service AvailabilityMajor feature unavailableAuth service down
User Impact>50% users affectedCritical API failing
Data ImpactData integrity concernsSync failures
SecurityActive vulnerabilityExploit in progress
RevenueSignificant revenue impactCheckout broken
Response: Senior engineer + management, customer updates

SEV2 - Significant (1 hour response)

IndicatorThresholdExample
Service AvailabilityDegraded performanceHigh latency
User Impact10-50% users affectedRegional outage
Data ImpactPotential data issuesValidation errors
SecurityPotential vulnerabilityUnpatched CVE
RevenueIndirect revenue impactFeature unavailable
Response: On-call engineer, internal notifications

SEV3 - Minor (4 hour response)

IndicatorThresholdExample
Service AvailabilityPartial degradationSlow endpoint
User Impact<10% users affectedEdge case failures
Data ImpactNo data impactDisplay issues only
SecurityLow-risk issueInfo disclosure
RevenueMinimal impactAnalytics delayed
Response: Standard queue, on-call engineer

SEV4 - Cosmetic (Best effort)

IndicatorThresholdExample
Service AvailabilityCosmetic issuesUI alignment off
User ImpactIndividual usersSingle user report
Data ImpactNoneN/A
SecurityInformationalSecurity scan finding
RevenueNoneN/A
Response: Backlog, any engineer

FORGE Gate Compliance

Before invoking this agent, ensure:
  • Alert detected: Automated alert or manual report received
  • Metrics accessible: System metrics and health endpoints available
  • Health endpoints responding: Or confirmed down with error messages
Verification: Factory Orchestrator confirms alert data completeness
This agent completes successfully when:
  • Severity assigned: SEV0-4 or non-incident classification
  • Impact assessed: User and service impact estimated
  • Services identified: Affected services and regions listed
  • Routing complete: Incident Commander notified if warranted (SEV0-3)
  • Triage documented: Full triage report produced
  • Decision record logged: Classification rationale in ADR format
Verification: Gatekeeper validates triage completeness before handoff
All severity classification decisions are logged as:
date:2024-01-15T14:07:00Z|context:API error rate 12%, connection pool 100%|decision:Classify as SEV2 vs SEV1|rationale:Service degraded but not fully down, 150 users affected (&lt;50%)|consequences:Immediate on-call escalation but not executive notification|status:accepted

Integration Points

Control Plane API

Used Endpoints:
  • GET /api/v1/health - Overall service health status
  • GET /api/v1/health/dependencies - Dependency health (database, Redis, etc.)
  • GET /api/v1/workflows/stats - Workflow execution metrics
  • GET /api/v1/deployments/recent - Recent deployment history

Veritas Prompt Library

Consumes:
  • vrt-triage01: Triage decision trees and severity criteria
  • vrt-symptoms01: Symptom-to-cause mapping patterns
  • vrt-impact01: Impact assessment frameworks
Produces:
  • Novel triage patterns in veritas/agent-prompts/incident/
  • Status: draft (requires review)
AgentRelationshipIntegration Point
Incident CommanderDownstreamHands off classified incidents for orchestration
Factory OrchestratorUpstreamReceives alerts, routes to Triage Responder
Postmortem AnalystPeerMay provide historical incident patterns for classification

Workflow Process

1

Alert Reception

Receive and validate incoming alert
  • Validate alert is not duplicate
  • Check for stale alerts
  • Capture initial symptoms
  • Record timestamp
2

Symptom Collection

Gather additional symptoms and context
  • Collect related metrics
  • Check service health
  • Identify recent changes
  • Correlate related alerts
3

Severity Classification

Classify incident severity
  • Apply decision tree
  • Calculate confidence score
  • Assign category
  • Document rationale
4

Impact Assessment

Determine scope and blast radius
  • Estimate user impact
  • List affected services
  • Describe functionality impact
  • Identify regions affected
5

Routing Decision

Decide routing and escalation
  • Determine if incident warranted
  • Assign commander
  • Notify teams
  • Generate triage report

Error Handling

Common Issues

Low Confidence Classification (<70%)Cause: Ambiguous symptoms, conflicting signalsResolution: Escalate to Incident Commander with “investigation required” status, provide multiple hypotheses
Missing Context DataCause: Health endpoints unavailable, metrics not accessibleResolution: Use available data, document gaps, escalate if critical service affected
Alert Storm (Multiple Simultaneous)Cause: Cascading failure triggering many alertsResolution: Correlate alerts, identify root service, classify as single incident

Escalation Path

If Triage Responder cannot complete assessment:
  1. Escalate to Incident Commander with partial triage
  2. Document confidence gaps and missing data
  3. Default to higher severity if ambiguous (better safe than sorry)
  4. Log decision record with classification uncertainty

Success Metrics

MetricTargetCritical Threshold
Triage Latency<2 minutes>5 minutes
Classification Accuracy>95%<90%
False Positive Rate<5%>10%
Confidence Score>80%<70%

Source Files

View Agent Source

Maintained in so1-agents repository under agents/incident/triage-responder.md