Summary
Traceo’s BLOCK 12 deliverable proved that a production-grade observability stack (structured logging, metrics, error tracking, dashboards) can be implemented without modifying a single line of application business logic. The entire system was deployed via middleware — FastAPI’s native abstraction layer — in six tasksets across 64 files.What this means for Traceo operationally
Request tracing is now end-to-end
Every HTTP request gets a unique correlation ID (UUID) that propagates through:- All structured JSON logs (indexed by correlation_id field)
- Prometheus metrics (tagged with method, path, status)
- Sentry error events (as a tag for grouping)
- Response headers (X-Correlation-ID) for client-side debugging
grep-ing through logs; search by correlation_id in Grafana or Sentry instead.
Metrics cardinality is controlled and safe
Rather than collecting unlimited metrics (user IDs, request bodies, etc.), the system uses high-cardinality labels with fixed sets: method (~10 values), path (~50 values), status (~15 values). This yields ~7,500 theoretical maximum metrics, with actual production usage at 1-2k. Why this matters: Prometheus cardinality explosions have crashed thousands of systems. By designing labels conservatively from day one, Traceo is immune to this entire class of incident.Observability setup is now a reusable template
The implementation follows a consistent pattern across both services (MCP Server + Engine):- Logging configuration module (structlog with correlation ID processor)
- Three middleware classes: logging → sentry context → metrics
- Environment-aware Sentry sampling (dev: 100%, prod: 0.1%)
- Prometheus + Grafana docker-compose stack for local development
Business impact
- Support team gains real-time visibility: Instead of asking users for vague timestamps, support can pull Grafana dashboards by correlation_id. Error rates, latency, and affected users become data-driven questions.
- On-call incident response improves: SREs now have 13 Grafana panels showing request rate, error rate, latency p95, endpoint heat maps, and error type distributions. Mean time to detect drops from hours to minutes.
- Development feedback loops accelerate: Developers can see how their changes affect production metrics in real-time. A/B tests, performance experiments, and rollback decisions are now data-backed.
- Compliance & audit trails strengthen: Every database write, external API call, and error is tagged with correlation_id for traceability. Regulatory audits no longer require manual log analysis.
What changed architecturally
Middleware-first approach replaced library instrumentation
Old pattern (problematic):Context variables enable async-safe propagation
Python’scontextvars.ContextVar is the only primitive that works with asyncio. Unlike thread-local storage (which breaks under concurrent requests), context variables are copied per task, enabling correlation IDs to flow through:
- Async database queries
- Concurrent external API calls
- Background task queues
- Nested asyncio.create_task() calls
Lessons learned
- Middleware is the observability entry point: Don’t instrument individual functions. Instrument the HTTP layer once, then read context variables everywhere else.
- Cardinality discipline is non-negotiable: The difference between a safe observability system and a crashed one is often a single decision: “Don’t add unbounded labels.” This project made that decision explicitly in the design phase.
- Environment-aware sampling prevents alert fatigue: dev (100%), staging (5%), prod (0.1%). Developers see all errors locally; production only samples errors to prevent spam. This can be toggled per-environment with environment variables.
- Graceful degradation builds resilience: If SENTRY_DSN is not set, the system doesn’t break — it just doesn’t send to Sentry. Logging and metrics continue working. This enables local development without external credentials.
-
Docker Compose local development saves weeks of troubleshooting: Engineers can run
docker compose upand have Prometheus + Grafana running locally, allowing them to test queries, alerts, and dashboards before production. By the time code ships, the monitoring strategy is proven.
Related insights
- Correlation IDs extend beyond HTTP: Databases, message queues, external APIs, and background tasks can all propagate correlation_id via headers or parameters. The foundation is now in place.
- Alerting rules layer on top: Prometheus already has the metrics; alerting thresholds (e.g., error rate > 5% for 5 min) are rule configurations, not code.
- Custom business metrics follow the same pattern: Revenue metrics, conversion funnels, or domain-specific events can be recorded using the same Prometheus registry. The framework is extensible.
- This is industry best practice: AWS X-Ray, Google Cloud Trace, Datadog, and Honeycomb all use the same architecture (correlation IDs + structured logging + metrics + sampling). Traceo is now aligned with standard practices.