Skip to main content

FNP - Observability - Prometheus Metrics & Grafana Dashboards

Summary (Explain Like I’m 5)

Running a system is like driving a car:
  • Without observability: Eyes closed, can’t see the road ✗
  • With metrics: Speedometer, fuel gauge, temperature ✓
  • With traces: Black box flight recorder showing everything that happened ✓
  • With logs: Driver’s journal explaining decisions ✓
FNP observability collects all three: metrics (numbers), traces (events), logs (details) to understand system health in real-time.

Technical Deep Dive

Observability Stack:
┌─────────────────────────────────────────────────────────┐
│                    Grafana Dashboards                   │
│  (Visualize 50+ metrics, 20+ panels, alerting)         │
└──────────────┬──────────────────────────────────────────┘

┌──────────────▼──────────────────────────────────────────┐
│                  Prometheus Scraper                      │
│  (Pull metrics from /metrics endpoint every 15s)       │
└──────────────┬──────────────────────────────────────────┘

┌──────────────▼──────────────────────────────────────────┐
│              FNP Service /metrics Endpoint               │
│  (Expose metrics in Prometheus text format)             │
└──────────────┬──────────────────────────────────────────┘
Key Metrics Exported:
# Operation Metrics
fnp_operations_inserted_total{replica_id="1"}        # Counter
fnp_operations_deleted_total{replica_id="2"}         # Counter
fnp_operations_halo2_proof_time_ms{quantile="0.99"}  # Histogram
fnp_operations_m2ore_comparison_time_ms               # Histogram

# Proof Generation (Server-Side)
fnp_halo2_proof_generation_time_ms{type="insert", quantile="0.95"}  # ms
fnp_halo2_proof_verification_time_ms{quantile="0.99"}               # ms
fnp_halo2_proof_cache_hits_total                                    # Counter
fnp_halo2_proof_cache_misses_total                                  # Counter

# Cryptography Metrics
fnp_kyber_encapsulation_time_ms                  # Histogram
fnp_m2ore_encryption_time_ms                     # Histogram
fnp_dilithium_signature_time_ms                  # Histogram
fnp_m2ore_comparison_time_us                     # Histogram (microseconds)

# Protocol Metrics
fnp_operation_latency_ms{type="insert", quantile="0.50"}  # P50
fnp_operation_latency_ms{type="insert", quantile="0.99"}  # P99
fnp_operation_latency_ms{type="delete", quantile="0.99"}  # P99
fnp_operation_failure_rate{reason="invalid_proof"}        # Counter
fnp_operation_failure_rate{reason="invalid_signature"}    # Counter

# Document Metrics
fnp_document_character_count{document_id="doc-123"}  # Gauge
fnp_document_replica_count{document_id="doc-123"}    # Gauge
fnp_document_merge_conflicts_total                   # Counter (CRDT)

# Network Metrics
fnp_server_request_duration_ms{path="/verify/insert", quantile="0.95"}
fnp_server_request_total{path="/verify/insert", status="200"}
fnp_server_request_total{path="/verify/insert", status="400"}
fnp_server_bytes_sent_total
fnp_server_bytes_received_total

# Infrastructure Metrics (from Karpenter)
karpenter_nodes_count{capacity_type="on-demand"}
karpenter_nodes_count{capacity_type="spot"}
karpenter_node_cost_per_hour
karpenter_pod_evictions_total
Prometheus Scrape Config:
prometheus.yml:
    scrape_configs:
        - job_name: "fnp-service"
          metrics_path: "/metrics"
          scrape_interval: 15s
          scrape_timeout: 10s

          kubernetes_sd_configs:
              - role: pod
                namespaces:
                    names:
                        - fnp-production
                        - fnp-staging

          relabel_configs:
              - source_labels: [__meta_kubernetes_pod_annotation_prometheus]
                action: keep
                regex: "true"

              - source_labels: [__meta_kubernetes_pod_name]
                action: replace
                target_label: pod_name

alert_rules:
    - alert: HighHalo2ProofTime
      expr: histogram_quantile(0.99, fnp_halo2_proof_generation_time_ms) > 5
      for: 5m
      annotations:
          summary: "P99 Halo2 proof generation > 5ms (target: 1.2ms)"

    - alert: HighOperationFailureRate
      expr: rate(fnp_operation_failure_rate[5m]) > 0.01
      for: 5m
      annotations:
          summary: "Operation failure rate > 1% ({{ $value | humanizePercentage }})"

    - alert: HighNetworkLatency
      expr: histogram_quantile(0.95, fnp_server_request_duration_ms) > 500
      for: 5m
      annotations:
          summary: "P95 request latency > 500ms"
Grafana Dashboards (12+ Panels):
Dashboard: FNP Protocol Metrics
  layout: 4 rows × 3 columns

Row 1: Operations Overview
  Panel 1: Operations per Second (line chart)
    - fnp_operations_inserted_total (rate)
    - fnp_operations_deleted_total (rate)
    - threshold: 1000 ops/sec (green), >1000 (yellow), >2000 (red)

  Panel 2: Operation Latency (line chart, P50/P95/P99)
    - histogram_quantile(0.50, fnp_operation_latency_ms)
    - histogram_quantile(0.95, fnp_operation_latency_ms)
    - histogram_quantile(0.99, fnp_operation_latency_ms)
    - SLA line: 500ms

  Panel 3: Operation Success Rate (gauge)
    - rate(fnp_operations_inserted_total) / (rate(fnp_operations_inserted_total) + rate(fnp_operation_failure_rate))
    - Threshold: 99.99% (green), <99% (red)

Row 2: Cryptography Performance
  Panel 4: Halo2 Proof Time Distribution (histogram)
    - fnp_halo2_proof_generation_time_ms (buckets)
    - fnp_halo2_proof_verification_time_ms (buckets)

  Panel 5: Proof Cache Hit Ratio (stat)
    - fnp_halo2_proof_cache_hits_total / (hits + misses)
    - Target: >70% hit ratio

  Panel 6: M²-ORE Comparison Speed (heatmap)
    - fnp_m2ore_comparison_time_us over time
    - Threshold: <100 microseconds

Row 3: Resource Utilization
  Panel 7: CPU Usage by Pod (stacked area)
    - container_cpu_usage_seconds_total[pod=fnp-*]
    - Threshold: 80% (warning), 95% (alert)

  Panel 8: Memory Usage by Pod (stacked area)
    - container_memory_usage_bytes[pod=fnp-*]
    - Threshold: 1Gi (warning), 1.5Gi (alert)

  Panel 9: Network I/O (stack area)
    - fnp_server_bytes_sent_total (rate)
    - fnp_server_bytes_received_total (rate)

Row 4: Cost & Scaling
  Panel 10: Node Cost Breakdown (pie chart)
    - karpenter_nodes_count{capacity_type="on-demand"} × $2/hr
    - karpenter_nodes_count{capacity_type="spot"} × $0.40/hr

  Panel 11: Pod Scaling (line chart)
    - karpenter_nodes_count (total)
    - karpenter_pod_count (total)
    - CPU threshold markers

  Panel 12: Failure Rate by Reason (bar chart)
    - fnp_operation_failure_rate{reason="invalid_proof"}
    - fnp_operation_failure_rate{reason="invalid_signature"}
    - fnp_operation_failure_rate{reason="replay_attack"}
Distributed Tracing (Jaeger + OpenTelemetry):
// In FNP server code
use opentelemetry::trace::TracerProvider;

#[instrument]
async fn insert_operation(op: Operation) -> Result<Proof> {
    // Span automatically created: "insert_operation"

    let _kyber_span = info_span!("kyber_encapsulation").entered();
    let (ss, ct) = kyber_encaps()?;  // Auto-traced
    drop(_kyber_span);

    let _m2ore_span = info_span!("m2ore_encryption").entered();
    let enc_id = m2ore_encrypt(&op.position)?;
    drop(_m2ore_span);

    let _halo2_span = info_span!("halo2_prove",
        proof_type = "insert",
        num_constraints = 51350
    ).entered();
    let proof = halo2_prove(witness)?;
    drop(_halo2_span);

    let _merge_span = info_span!("crdt_merge").entered();
    merge_document(enc_id, ct, proof)?;
    // Span automatically ends

    Ok(proof)
}

Mermaid Diagrams

Key Terms

  • Metric → Numerical measurement (latency, throughput, errors)
  • Histogram → Metric showing distribution of values (quantiles)
  • Gauge → Metric that can go up or down (current value)
  • Counter → Metric that only increases (total operations)
  • P50/P95/P99 → Percentile latencies (50th/95th/99th)
  • SLA → Service Level Agreement (99.99% uptime target)
  • Trace → Complete request flow (all spans, timestamps)
  • Span → Single operation within trace (e.g., “Halo2 proof generation”)

Q/A

Q: What does a Halo2 proof generation spike mean? A: Spike > 2ms indicates: load increasing, or resource contention. If P99 consistently >2ms, scale up server replicas or enable proof caching. Monitor CPU and memory simultaneously. Q: How does proof cache improve performance? A: Identical public inputs → identical proofs. Cache hit returns proof in <1ms vs 1.2ms generation. Target: >70% hit ratio. Cache keyed by deterministic hash of public input. Q: What if M²-ORE comparison time suddenly increases? A: Likely causes: (1) Load spike causing CPU contention, (2) Node throttling (cost optimization), (3) Cache misses on comparisons. Check CPU % and network I/O simultaneously. Q: How are SLA violations triggered? A: If P95 latency > 500ms for 5 minutes straight, alert fires. On-call engineer receives page, checks dashboard, identifies root cause. Auto-scaling rules usually resolve within 2-3 minutes. Q: Can I drill down from Grafana dashboard to traces? A: Yes. Click on latency spike in dashboard → drill into Jaeger traces → see exact timing of each span (Kyber, M²-ORE, Halo2, CRDT). Identify bottleneck.

Example / Analogy

Hospital Monitoring Analogy:
  • Metrics: Patient vital signs (heart rate, blood pressure, oxygen) - numbers
  • Traces: Complete surgery timeline (intubated 9:05, incision 9:10, bleeding detected 9:15) - events
  • Logs: Doctor’s notes explaining decisions - details
  • Dashboards: Real-time display of all vitals - visualization
  • Alerts: Alarm when heart rate drops - proactive
FNP observability works identically: continuously monitor health, trace requests, log details, visualize on dashboards, alert on anomalies.
Cross-References: Deployment Architecture, Cost Optimization, Security Monitoring Category: Operations | Monitoring | DevOps | Observability Difficulty: Intermediate ⭐⭐⭐ Updated: 2025-11-28