Quick Reference
Property Value Domain Incident FORGE Stage Cross-cutting (first responder) Version 1.0.0 Primary Output Triage reports with severity classification and routing
Use this agent when you need to:
Evaluate incoming alerts for validity and urgency
Classify incident severity using SEV0-SEV4 scale
Assess impact scope and affected user count
Route incidents to appropriate responder teams
Core Capabilities
Alert Assessment Evaluates incoming alerts for validity, urgency, and correlation with existing issues
Severity Classification Assigns appropriate SEV0-SEV4 level based on impact, scope, and urgency
Impact Analysis Determines blast radius, affected services, and estimated user impact
Routing Decision Directs incidents to appropriate responders with urgency guidance
When to Use
Ideal Use Cases
New alerts from monitoring systems (Datadog, PagerDuty, Sentry)
Manual incident reports from users or support team
Anomaly detection triggers requiring classification
Multi-symptom scenarios needing correlation and assessment
Unknown incidents requiring initial investigation guidance
Not Recommended For
Usage Examples
Database Connection Pool
Complete Outage
False Positive
Performance Degradation
SEV2: Connection Pool Exhaustion Alert : API error rate >5%, response latency spikeTriage Report {
"type" : "triage-report" ,
"timestamp" : "2024-01-15T14:07:00Z" ,
"content" : {
"alert_id" : "pd-alert-789xyz" ,
"alert_source" : "pagerduty" ,
"received_at" : "2024-01-15T14:05:00Z" ,
"triage_completed_at" : "2024-01-15T14:07:00Z" ,
"classification" : {
"is_incident" : true ,
"severity" : "SEV2" ,
"confidence" : 0.85 ,
"category" : "availability"
},
"symptoms" : [
{
"symptom" : "HTTP 5xx error rate elevated" ,
"source" : "Datadog APM" ,
"started_at" : "2024-01-15T14:03:00Z" ,
"current_value" : "12%" ,
"threshold" : "< 1%"
},
{
"symptom" : "API response latency p99 spike" ,
"source" : "Datadog APM" ,
"current_value" : "4500ms" ,
"threshold" : "< 500ms"
},
{
"symptom" : "Database connection pool utilization" ,
"source" : "Railway metrics" ,
"current_value" : "100%" ,
"threshold" : "< 80%"
}
],
"impact" : {
"users_affected" : "many" ,
"user_count_estimate" : 150 ,
"services_affected" : [ "so1-control-plane-api" ],
"regions_affected" : [ "us-east-1" ],
"functionality_impact" : "Workflow executions failing, API requests timing out"
},
"initial_hypothesis" : [
{
"hypothesis" : "Database connection pool exhaustion due to traffic spike" ,
"confidence" : 0.75 ,
"supporting_evidence" : [
"Connection pool at 100%" ,
"Errors correlate with pool exhaustion" ,
"Traffic 2x normal levels"
],
"investigation_steps" : [
"Check for long-running queries" ,
"Review connection pool configuration" ,
"Check for connection leaks"
]
}
],
"routing" : {
"escalate_to_incident" : true ,
"suggested_commander" : "oncall-backend" ,
"teams_to_involve" : [ "backend" , "platform" ],
"urgency" : "immediate"
},
"context" : {
"recent_deployments" : [
{
"service" : "so1-control-plane-api" ,
"deployed_at" : "2024-01-15T10:30:00Z" ,
"commit" : "abc123"
}
],
"recent_changes" : [ "New workflow bulk execution feature deployed" ]
}
}
}
Triage Decision :
Severity : SEV2 (significant degradation, many users affected)
Confidence : 85% (clear symptoms, known pattern)
Routing : Immediate escalation to backend on-call
Hypothesis : Connection pool exhaustion (75% confidence)
Investigation : Check connection pool config, long-running queries
SEV0: Database Cluster Failure Alert : All services reporting database connection failuresTriage Report {
"type" : "triage-report" ,
"timestamp" : "2024-01-20T09:16:00Z" ,
"content" : {
"alert_id" : "dd-critical-001" ,
"alert_source" : "datadog" ,
"received_at" : "2024-01-20T09:15:00Z" ,
"triage_completed_at" : "2024-01-20T09:16:00Z" ,
"classification" : {
"is_incident" : true ,
"severity" : "SEV0" ,
"confidence" : 0.99 ,
"category" : "availability"
},
"symptoms" : [
{
"symptom" : "All services reporting database connection failures" ,
"source" : "Datadog APM" ,
"started_at" : "2024-01-20T09:15:00Z" ,
"current_value" : "100% error rate" ,
"threshold" : "< 0.1%"
},
{
"symptom" : "Database cluster health check failing" ,
"source" : "Railway monitoring" ,
"started_at" : "2024-01-20T09:15:00Z" ,
"current_value" : "0/3 nodes responding" ,
"threshold" : "3/3 nodes"
}
],
"impact" : {
"users_affected" : "all" ,
"user_count_estimate" : "all active users" ,
"services_affected" : [ "all" ],
"regions_affected" : [ "all" ],
"functionality_impact" : "Complete service outage - all functionality unavailable"
},
"initial_hypothesis" : [
{
"hypothesis" : "Database cluster failure - all nodes unresponsive" ,
"confidence" : 0.95 ,
"supporting_evidence" : [
"All nodes failing health checks" ,
"No database connections successful" ,
"All services affected simultaneously"
],
"investigation_steps" : [
"Check Railway dashboard for cluster status" ,
"Verify network connectivity" ,
"Prepare failover to secondary cluster"
]
}
],
"routing" : {
"escalate_to_incident" : true ,
"suggested_commander" : "engineering-lead" ,
"teams_to_involve" : [ "platform" , "backend" , "devops" , "executive" ],
"urgency" : "immediate"
},
"context" : {
"recent_deployments" : [],
"recent_changes" : [],
"related_alerts" : [ "database-cpu-high (30 minutes ago)" ],
"known_issues" : []
}
}
}
Triage Decision :
Severity : SEV0 (complete outage, all users affected)
Confidence : 99% (unambiguous symptoms)
Routing : Immediate escalation to engineering leadership + executive team
Hypothesis : Database cluster failure (95% confidence)
Action : All-hands response, prepare failover
Non-Incident: Transient Spike Alert : Error rate >2% for 30 secondsTriage Report {
"type" : "triage-report" ,
"timestamp" : "2024-01-18T11:32:00Z" ,
"content" : {
"alert_id" : "pd-alert-456abc" ,
"alert_source" : "pagerduty" ,
"received_at" : "2024-01-18T11:30:00Z" ,
"triage_completed_at" : "2024-01-18T11:32:00Z" ,
"classification" : {
"is_incident" : false ,
"severity" : "none" ,
"confidence" : 0.90 ,
"category" : "transient"
},
"symptoms" : [
{
"symptom" : "Error rate spike" ,
"source" : "Datadog" ,
"started_at" : "2024-01-18T11:29:45Z" ,
"current_value" : "0.1% (resolved)" ,
"threshold" : "< 2%"
}
],
"impact" : {
"users_affected" : "none" ,
"user_count_estimate" : 0 ,
"services_affected" : [],
"functionality_impact" : "No sustained impact observed"
},
"initial_hypothesis" : [
{
"hypothesis" : "Transient network blip or retry storm" ,
"confidence" : 0.85 ,
"supporting_evidence" : [
"Spike lasted <30 seconds" ,
"Error rate returned to baseline" ,
"No related symptoms" ,
"No user reports"
],
"investigation_steps" : [
"Monitor for recurrence" ,
"Review logs if pattern repeats"
]
}
],
"routing" : {
"escalate_to_incident" : false ,
"suggested_commander" : "none" ,
"teams_to_involve" : [],
"urgency" : "none"
},
"context" : {
"recent_deployments" : [],
"recent_changes" : [],
"related_alerts" : [],
"known_issues" : []
}
}
}
Triage Decision :
Severity : None (not an incident)
Confidence : 90% (transient, self-resolved)
Routing : No escalation, close alert
Monitoring : Watch for recurrence
SEV3: Slow Query Pattern Alert : API p95 latency >2sTriage Report {
"type" : "triage-report" ,
"timestamp" : "2024-01-19T14:22:00Z" ,
"content" : {
"alert_id" : "dd-perf-123" ,
"alert_source" : "datadog" ,
"received_at" : "2024-01-19T14:20:00Z" ,
"triage_completed_at" : "2024-01-19T14:22:00Z" ,
"classification" : {
"is_incident" : true ,
"severity" : "SEV3" ,
"confidence" : 0.80 ,
"category" : "performance"
},
"symptoms" : [
{
"symptom" : "Workflow listing endpoint slow" ,
"source" : "Datadog APM" ,
"started_at" : "2024-01-19T14:15:00Z" ,
"current_value" : "2.8s p95" ,
"threshold" : "< 1.5s"
},
{
"symptom" : "Database query duration elevated" ,
"source" : "Database monitoring" ,
"current_value" : "2.5s avg" ,
"threshold" : "< 500ms"
}
],
"impact" : {
"users_affected" : "some" ,
"user_count_estimate" : 25 ,
"services_affected" : [ "workflow-api" ],
"functionality_impact" : "Slow workflow loading, but functional"
},
"initial_hypothesis" : [
{
"hypothesis" : "N+1 query pattern or missing database index" ,
"confidence" : 0.70 ,
"supporting_evidence" : [
"Query duration matches latency increase" ,
"Specific endpoint affected" ,
"Recent feature added pagination"
],
"investigation_steps" : [
"Review query patterns for N+1" ,
"Check database indexes" ,
"Analyze slow query logs"
]
}
],
"routing" : {
"escalate_to_incident" : true ,
"suggested_commander" : "oncall-backend" ,
"teams_to_involve" : [ "backend" ],
"urgency" : "standard"
}
}
}
Triage Decision :
Severity : SEV3 (degraded but functional, limited users)
Confidence : 80% (clear performance pattern)
Routing : Standard escalation to backend on-call
Hypothesis : Query optimization needed (70% confidence)
Triage Report Schema
interface TriageReport {
type : "triage-report" ;
version : "1.0.0" ;
generated_by : "triage-responder" ;
timestamp : string ; // ISO8601
content : {
alert_id : string ;
alert_source : "pagerduty" | "datadog" | "sentry" | "manual" | "n8n" ;
received_at : string ;
triage_completed_at : string ;
classification : {
is_incident : boolean ;
severity : "SEV0" | "SEV1" | "SEV2" | "SEV3" | "SEV4" | "none" ;
confidence : number ; // 0.0 - 1.0
category : "availability" | "performance" | "security" | "data" | "functional" ;
};
symptoms : Array <{
symptom : string ;
source : string ;
started_at : string ;
current_value : string ;
threshold : string ;
}>;
impact : {
users_affected : "none" | "some" | "many" | "all" ;
user_count_estimate : number | string ;
services_affected : string [];
regions_affected : string [];
functionality_impact : string ;
};
initial_hypothesis : Array <{
hypothesis : string ;
confidence : number ;
supporting_evidence : string [];
investigation_steps : string [];
}>;
routing : {
escalate_to_incident : boolean ;
suggested_commander : string ;
teams_to_involve : string [];
urgency : "immediate" | "urgent" | "standard" | "none" ;
};
context : {
recent_deployments : Array < object >;
recent_changes : string [];
related_alerts : string [];
known_issues : string [];
};
};
}
Severity Classification Matrix
Indicator Threshold Example Service Availability Complete outage All services down User Impact All users affected 100% error rate Data Impact Data loss or corruption Database failure Security Active breach Unauthorized access detected Revenue Direct revenue loss Payment processing down
Response : All-hands, executive notification, customer communication
SEV1 - Major (15 min response)
Indicator Threshold Example Service Availability Major feature unavailable Auth service down User Impact >50% users affected Critical API failing Data Impact Data integrity concerns Sync failures Security Active vulnerability Exploit in progress Revenue Significant revenue impact Checkout broken
Response : Senior engineer + management, customer updates
SEV2 - Significant (1 hour response)
Indicator Threshold Example Service Availability Degraded performance High latency User Impact 10-50% users affected Regional outage Data Impact Potential data issues Validation errors Security Potential vulnerability Unpatched CVE Revenue Indirect revenue impact Feature unavailable
Response : On-call engineer, internal notifications
SEV3 - Minor (4 hour response)
Indicator Threshold Example Service Availability Partial degradation Slow endpoint User Impact <10% users affected Edge case failures Data Impact No data impact Display issues only Security Low-risk issue Info disclosure Revenue Minimal impact Analytics delayed
Response : Standard queue, on-call engineer
SEV4 - Cosmetic (Best effort)
Indicator Threshold Example Service Availability Cosmetic issues UI alignment off User Impact Individual users Single user report Data Impact None N/A Security Informational Security scan finding Revenue None N/A
Response : Backlog, any engineer
FORGE Gate Compliance
Entry Gates (Pre-conditions)
Before invoking this agent, ensure:
Alert detected : Automated alert or manual report received
Metrics accessible : System metrics and health endpoints available
Health endpoints responding : Or confirmed down with error messages
Verification : Factory Orchestrator confirms alert data completeness
Exit Gates (Post-conditions)
This agent completes successfully when:
Severity assigned : SEV0-4 or non-incident classification
Impact assessed : User and service impact estimated
Services identified : Affected services and regions listed
Routing complete : Incident Commander notified if warranted (SEV0-3)
Triage documented : Full triage report produced
Decision record logged : Classification rationale in ADR format
Verification : Gatekeeper validates triage completeness before handoff
Integration Points
Control Plane API
Used Endpoints:
GET /api/v1/health - Overall service health status
GET /api/v1/health/dependencies - Dependency health (database, Redis, etc.)
GET /api/v1/workflows/stats - Workflow execution metrics
GET /api/v1/deployments/recent - Recent deployment history
Veritas Prompt Library
Consumes:
vrt-triage01: Triage decision trees and severity criteria
vrt-symptoms01: Symptom-to-cause mapping patterns
vrt-impact01: Impact assessment frameworks
Produces:
Novel triage patterns in veritas/agent-prompts/incident/
Status: draft (requires review)
Agent Relationship Integration Point Incident Commander Downstream Hands off classified incidents for orchestration Factory Orchestrator Upstream Receives alerts, routes to Triage Responder Postmortem Analyst Peer May provide historical incident patterns for classification
Workflow Process
Alert Reception
Receive and validate incoming alert
Validate alert is not duplicate
Check for stale alerts
Capture initial symptoms
Record timestamp
Symptom Collection
Gather additional symptoms and context
Collect related metrics
Check service health
Identify recent changes
Correlate related alerts
Severity Classification
Classify incident severity
Apply decision tree
Calculate confidence score
Assign category
Document rationale
Impact Assessment
Determine scope and blast radius
Estimate user impact
List affected services
Describe functionality impact
Identify regions affected
Routing Decision
Decide routing and escalation
Determine if incident warranted
Assign commander
Notify teams
Generate triage report
Error Handling
Common Issues
Low Confidence Classification (<70%) Cause : Ambiguous symptoms, conflicting signalsResolution : Escalate to Incident Commander with “investigation required” status, provide multiple hypotheses
Missing Context Data Cause : Health endpoints unavailable, metrics not accessibleResolution : Use available data, document gaps, escalate if critical service affected
Alert Storm (Multiple Simultaneous) Cause : Cascading failure triggering many alertsResolution : Correlate alerts, identify root service, classify as single incident
Escalation Path
If Triage Responder cannot complete assessment:
Escalate to Incident Commander with partial triage
Document confidence gaps and missing data
Default to higher severity if ambiguous (better safe than sorry)
Log decision record with classification uncertainty
Success Metrics
Metric Target Critical Threshold Triage Latency <2 minutes >5 minutes Classification Accuracy >95% <90% False Positive Rate <5% >10% Confidence Score >80% <70%
Source Files
View Agent Source Maintained in so1-agents repository under agents/incident/triage-responder.md