Skip to main content

What happened

Two pages on rover.so1.io (MCP Registry and Agents) were returning 404 errors — completely broken for any user visiting them. This was a production outage affecting platform visibility into MCP tooling.

Root cause

The backend API (BFF) was being deployed from an old, standalone repository instead of the current monorepo. When new features were added to the monorepo, the deployed version never received them. The pages worked locally but failed in production because Railway (our hosting provider) was pointed at stale code. Resolution: Switched Railway to deploy from the monorepo (so1-io/so1-console) targeting the correct subdirectory. Pages load correctly after redeploy. Added an automated check that logs warnings on startup if expected API routes are missing — early detection for future deployment misconfigurations.

Operational takeaway

This class of bug — “works locally, broken in production” — is caused by deployment source drift. When infrastructure points at a different repo than where development happens, changes silently diverge. The fix is configuration, not code. Added a startup self-check so this category of failure announces itself in logs rather than silently serving 404s.

Platform backlog: structured into 10 blocks

Audited all 34 open GitHub issues against the actual codebase. Closed 4 that were already resolved or stale. Organised the remaining 30 into 10 prioritised work blocks:
BlockFocusBusiness impact
T1Shared types sync + CIPrevents type drift that breaks both frontend and backend simultaneously
T2Build verification in CICatches broken deployments before they ship
T3CI efficiencyReduces feedback time and compute costs
T4Developer guardrailsPre-commit hooks, dependency automation, secrets scanning
T5API test coverage + error qualityBetter error messages for users, confidence in integrations
T6Landing page CI + cleanupValidates the marketing site doesn’t break silently
T7Landing polish + faviconAnimation consistency, missing browser tab icon
T8Extended CI (MCP servers, standalone)Safety net for secondary services
T9Cross-browser testing + releasesSafari/Firefox coverage, automated changelogs
T10Deployment pipelineStaging gates, rollback capability, audit trail
T1-T3 are the critical path — they address the kind of architectural debt that caused today’s outage. T4-T5 build confidence for feature velocity. T6-T10 are maturity investments.

Key discovery: shared types problem

The platform has three separate copies of its type definitions that drift independently. This is the single biggest technical risk — a type mismatch between frontend and backend can break the product for users without any test catching it. This is prioritised as T1.

Action items

  1. Push pending commit (df2e6781 — startup route validation) to main
  2. Begin T1: resolve shared types divergence
  3. Consider adding the /archive protocol to CLAUDE.md to reduce overhead on future session archiving
See also: so1-content/findings/2026-03-15-mcp-404-root-cause-issue-triage