Skip to main content

TekTree Observability Plan

Version: 1.0.0 Last Updated: 2025-12-16 Status: Foundation (Pre-Implementation)

Observability Pillars

  1. Logs: What happened
  2. Metrics: How much/how often
  3. Traces: Where time was spent

Logging

Structured Logging (Zap)

Format: JSON Required Fields:
  • timestamp (ISO 8601)
  • level (debug, info, warn, error, fatal)
  • service (service name)
  • trace_id (distributed tracing)
  • message (log message)
Example:
{
  "timestamp": "2025-12-16T14:30:00Z",
  "level": "info",
  "service": "knowledge-service",
  "trace_id": "abc123xyz",
  "user_id": "usr_123",
  "endpoint": "/api/v1/questions",
  "method": "POST",
  "status": 201,
  "duration_ms": 45,
  "message": "Question created successfully"
}

Log Levels

  • DEBUG: Detailed debug info (disabled in prod)
  • INFO: General informational messages
  • WARN: Warning messages (non-critical issues)
  • ERROR: Error messages (handled errors)
  • FATAL: Fatal errors (unrecoverable)

Log Retention

  • INFO: 7 days
  • WARN/ERROR/FATAL: 30 days
  • Archived logs: 90 days (cold storage)

Metrics

Prometheus Metrics

RED Metrics (Request, Error, Duration):
# Request rate
http_requests_total{service="api-gateway",endpoint="/api/v1/questions",method="POST",status="201"}

# Error rate
http_requests_total{service="api-gateway",endpoint="/api/v1/questions",method="POST",status="5xx"}

# Duration (histogram)
http_request_duration_seconds{service="api-gateway",endpoint="/api/v1/questions"}
USE Metrics (Utilization, Saturation, Errors):
# CPU utilization
process_cpu_seconds_total{service="user-service"}

# Memory utilization
process_resident_memory_bytes{service="user-service"}

# Disk I/O
node_disk_io_time_seconds_total{device="sda"}
Business Metrics:
# XP earned
gamification_xp_earned_total{source="question_posted"}

# Subscriptions
payment_subscriptions_total{tier="pro",status="active"}

# Content created
knowledge_content_created_total{type="question"}

Metrics Endpoints

All services expose: GET :8080/metrics

Distributed Tracing

OpenTelemetry

Trace Context Propagation:
traceparent: 00-abc123xyz...-def456...-01
tracestate: ...
Span Creation:
ctx, span := tracer.Start(ctx, "CreateQuestion")
defer span.End()

span.SetAttributes(
    attribute.String("user_id", userID),
    attribute.String("question_id", questionID),
)

Tracing Targets

  • HTTP requests (all endpoints)
  • Database queries (MongoDB, Redis)
  • Event publishing/consumption
  • External API calls (Polar)

Sampling

  • Production: 1% sampling
  • Staging: 10% sampling
  • Development: 100% sampling

Dashboards

1. System Health Dashboard

Panels:
  • CPU usage (all services)
  • Memory usage (all services)
  • Disk usage
  • Network I/O
  • Service uptime

2. API Performance Dashboard

Panels:
  • Request rate (RPS) by endpoint
  • Error rate (%) by endpoint
  • p50/p95/p99 latency by endpoint
  • Request duration histogram
  • Top slow endpoints (>200ms)

3. Database Dashboard

Panels:
  • MongoDB connection pool
  • Query latency (p95/p99)
  • Cache hit rate (Redis)
  • Slow queries (>100ms)
  • Document read/write rate

4. Gamification Dashboard

Panels:
  • XP earned per minute
  • Achievements unlocked per hour
  • Level-up rate
  • Leaderboard update rate
  • Streak retention rate

5. Business Metrics Dashboard

Panels:
  • DAU/MAU
  • Content creation rate (questions, insights)
  • Subscription conversions (free → paid)
  • MRR/ARR
  • Churn rate

Alerting

Critical Alerts (P0 - Page On-Call)

- alert: ServiceDown
  expr: up{job="api-gateway"} == 0
  for: 1m
  annotations:
    summary: "Service {{ $labels.job }} is down"

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 2m
  annotations:
    summary: "Error rate > 5% on {{ $labels.service }}"

- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds) > 0.5
  for: 5m
  annotations:
    summary: "p95 latency > 500ms on {{ $labels.service }}"

Warning Alerts (P1 - Slack)

- alert: ModerateLat ency
  expr: histogram_quantile(0.95, http_request_duration_seconds) > 0.3
  for: 5m
  annotations:
    summary: "p95 latency > 300ms on {{ $labels.service }}"

- alert: LowCacheHitRate
  expr: redis_keyspace_hits / (redis_keyspace_hits + redis_keyspace_misses) < 0.7
  for: 10m
  annotations:
    summary: "Cache hit rate < 70%"

Alert Routing

route:
  receiver: 'slack-default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack-warnings'

SLO/SLI Definitions

SLO (Service Level Objectives)

ServiceSLOMeasurement
API Gateway99.9% availabilityUptime
API Gatewayp95 < 200msRequest latency
Event Bus99.5% deliveryEvent delivery success rate
Database99.9% availabilityQuery success rate

SLI (Service Level Indicators)

# Availability SLI
(
  sum(rate(http_requests_total{status!~"5.."}[5m])) /
  sum(rate(http_requests_total[5m]))
) * 100

# Latency SLI
histogram_quantile(0.95, http_request_duration_seconds)

Incident Management

On-Call Rotation

  • Primary: 24/7 coverage
  • Secondary: Backup escalation
  • Rotation: Weekly

Runbooks

Service Down:
  1. Check health endpoint: GET /health
  2. Check logs for errors
  3. Check recent deployments
  4. Restart service if necessary
  5. Escalate if no resolution in 15 min
High Error Rate:
  1. Identify failing endpoint in logs
  2. Check recent code changes
  3. Check external dependencies (Polar, email)
  4. Rollback if recent deploy
  5. Apply hotfix if known issue
Database Connection Issues:
  1. Check MongoDB connection pool
  2. Check network connectivity
  3. Check resource limits (RAM, CPU)
  4. Restart MongoDB if necessary
  5. Scale up if resource constrained

Tools Stack

ToolPurpose
ZapStructured logging
PrometheusMetrics collection
GrafanaDashboards and visualization
JaegerDistributed tracing
AlertmanagerAlert routing and silencing
PagerDutyOn-call management
Railway LogsLog aggregation

Document Status: ✅ Complete Related Documents: NON_FUNCTIONAL_REQUIREMENTS.md, ARCHITECTURE_OVERVIEW.md