TekTree Observability Plan
Version: 1.0.0 Last Updated: 2025-12-16 Status: Foundation (Pre-Implementation)Observability Pillars
- Logs: What happened
- Metrics: How much/how often
- Traces: Where time was spent
Logging
Structured Logging (Zap)
Format: JSON Required Fields:- timestamp (ISO 8601)
- level (debug, info, warn, error, fatal)
- service (service name)
- trace_id (distributed tracing)
- message (log message)
Log Levels
- DEBUG: Detailed debug info (disabled in prod)
- INFO: General informational messages
- WARN: Warning messages (non-critical issues)
- ERROR: Error messages (handled errors)
- FATAL: Fatal errors (unrecoverable)
Log Retention
- INFO: 7 days
- WARN/ERROR/FATAL: 30 days
- Archived logs: 90 days (cold storage)
Metrics
Prometheus Metrics
RED Metrics (Request, Error, Duration):Metrics Endpoints
All services expose:GET :8080/metrics
Distributed Tracing
OpenTelemetry
Trace Context Propagation:Tracing Targets
- HTTP requests (all endpoints)
- Database queries (MongoDB, Redis)
- Event publishing/consumption
- External API calls (Polar)
Sampling
- Production: 1% sampling
- Staging: 10% sampling
- Development: 100% sampling
Dashboards
1. System Health Dashboard
Panels:- CPU usage (all services)
- Memory usage (all services)
- Disk usage
- Network I/O
- Service uptime
2. API Performance Dashboard
Panels:- Request rate (RPS) by endpoint
- Error rate (%) by endpoint
- p50/p95/p99 latency by endpoint
- Request duration histogram
- Top slow endpoints (>200ms)
3. Database Dashboard
Panels:- MongoDB connection pool
- Query latency (p95/p99)
- Cache hit rate (Redis)
- Slow queries (>100ms)
- Document read/write rate
4. Gamification Dashboard
Panels:- XP earned per minute
- Achievements unlocked per hour
- Level-up rate
- Leaderboard update rate
- Streak retention rate
5. Business Metrics Dashboard
Panels:- DAU/MAU
- Content creation rate (questions, insights)
- Subscription conversions (free → paid)
- MRR/ARR
- Churn rate
Alerting
Critical Alerts (P0 - Page On-Call)
Warning Alerts (P1 - Slack)
Alert Routing
SLO/SLI Definitions
SLO (Service Level Objectives)
| Service | SLO | Measurement |
|---|---|---|
| API Gateway | 99.9% availability | Uptime |
| API Gateway | p95 < 200ms | Request latency |
| Event Bus | 99.5% delivery | Event delivery success rate |
| Database | 99.9% availability | Query success rate |
SLI (Service Level Indicators)
Incident Management
On-Call Rotation
- Primary: 24/7 coverage
- Secondary: Backup escalation
- Rotation: Weekly
Runbooks
Service Down:- Check health endpoint:
GET /health - Check logs for errors
- Check recent deployments
- Restart service if necessary
- Escalate if no resolution in 15 min
- Identify failing endpoint in logs
- Check recent code changes
- Check external dependencies (Polar, email)
- Rollback if recent deploy
- Apply hotfix if known issue
- Check MongoDB connection pool
- Check network connectivity
- Check resource limits (RAM, CPU)
- Restart MongoDB if necessary
- Scale up if resource constrained
Tools Stack
| Tool | Purpose |
|---|---|
| Zap | Structured logging |
| Prometheus | Metrics collection |
| Grafana | Dashboards and visualization |
| Jaeger | Distributed tracing |
| Alertmanager | Alert routing and silencing |
| PagerDuty | On-call management |
| Railway Logs | Log aggregation |
Document Status: ✅ Complete Related Documents:
NON_FUNCTIONAL_REQUIREMENTS.md, ARCHITECTURE_OVERVIEW.md