FNP - Observability - Prometheus Metrics & Grafana Dashboards
Summary (Explain Like I’m 5)
Running a system is like driving a car:- Without observability: Eyes closed, can’t see the road ✗
- With metrics: Speedometer, fuel gauge, temperature ✓
- With traces: Black box flight recorder showing everything that happened ✓
- With logs: Driver’s journal explaining decisions ✓
Technical Deep Dive
Observability Stack:Mermaid Diagrams
Key Terms
- Metric → Numerical measurement (latency, throughput, errors)
- Histogram → Metric showing distribution of values (quantiles)
- Gauge → Metric that can go up or down (current value)
- Counter → Metric that only increases (total operations)
- P50/P95/P99 → Percentile latencies (50th/95th/99th)
- SLA → Service Level Agreement (99.99% uptime target)
- Trace → Complete request flow (all spans, timestamps)
- Span → Single operation within trace (e.g., “Halo2 proof generation”)
Q/A
Q: What does a Halo2 proof generation spike mean? A: Spike > 2ms indicates: load increasing, or resource contention. If P99 consistently >2ms, scale up server replicas or enable proof caching. Monitor CPU and memory simultaneously. Q: How does proof cache improve performance? A: Identical public inputs → identical proofs. Cache hit returns proof in <1ms vs 1.2ms generation. Target: >70% hit ratio. Cache keyed by deterministic hash of public input. Q: What if M²-ORE comparison time suddenly increases? A: Likely causes: (1) Load spike causing CPU contention, (2) Node throttling (cost optimization), (3) Cache misses on comparisons. Check CPU % and network I/O simultaneously. Q: How are SLA violations triggered? A: If P95 latency > 500ms for 5 minutes straight, alert fires. On-call engineer receives page, checks dashboard, identifies root cause. Auto-scaling rules usually resolve within 2-3 minutes. Q: Can I drill down from Grafana dashboard to traces? A: Yes. Click on latency spike in dashboard → drill into Jaeger traces → see exact timing of each span (Kyber, M²-ORE, Halo2, CRDT). Identify bottleneck.Example / Analogy
Hospital Monitoring Analogy:- Metrics: Patient vital signs (heart rate, blood pressure, oxygen) - numbers
- Traces: Complete surgery timeline (intubated 9:05, incision 9:10, bleeding detected 9:15) - events
- Logs: Doctor’s notes explaining decisions - details
- Dashboards: Real-time display of all vitals - visualization
- Alerts: Alarm when heart rate drops - proactive
Cross-References: Deployment Architecture, Cost Optimization, Security Monitoring Category: Operations | Monitoring | DevOps | Observability Difficulty: Intermediate ⭐⭐⭐ Updated: 2025-11-28