MCP Outage Resolved + Platform Backlog Organised into 10 Work Blocks

What happened

Two pages on rover.so1.io (MCP Registry and Agents) were returning 404 errors — completely broken for any user visiting them. This was a production outage affecting platform visibility into MCP tooling.

Root cause

The backend API (BFF) was being deployed from an old, standalone repository instead of the current monorepo. When new features were added to the monorepo, the deployed version never received them. The pages worked locally but failed in production because Railway (our hosting provider) was pointed at stale code. Resolution: Switched Railway to deploy from the monorepo (so1-io/so1-console) targeting the correct subdirectory. Pages load correctly after redeploy. Added an automated check that logs warnings on startup if expected API routes are missing — early detection for future deployment misconfigurations.

Operational takeaway

This class of bug — “works locally, broken in production” — is caused by deployment source drift. When infrastructure points at a different repo than where development happens, changes silently diverge. The fix is configuration, not code. Added a startup self-check so this category of failure announces itself in logs rather than silently serving 404s.

Platform backlog: structured into 10 blocks

Audited all 34 open GitHub issues against the actual codebase. Closed 4 that were already resolved or stale. Organised the remaining 30 into 10 prioritised work blocks:

Block	Focus	Business impact
T1	Shared types sync + CI	Prevents type drift that breaks both frontend and backend simultaneously
T2	Build verification in CI	Catches broken deployments before they ship
T3	CI efficiency	Reduces feedback time and compute costs
T4	Developer guardrails	Pre-commit hooks, dependency automation, secrets scanning
T5	API test coverage + error quality	Better error messages for users, confidence in integrations
T6	Landing page CI + cleanup	Validates the marketing site doesn’t break silently
T7	Landing polish + favicon	Animation consistency, missing browser tab icon
T8	Extended CI (MCP servers, standalone)	Safety net for secondary services
T9	Cross-browser testing + releases	Safari/Firefox coverage, automated changelogs
T10	Deployment pipeline	Staging gates, rollback capability, audit trail

T1-T3 are the critical path — they address the kind of architectural debt that caused today’s outage. T4-T5 build confidence for feature velocity. T6-T10 are maturity investments.

Key discovery: shared types problem

The platform has three separate copies of its type definitions that drift independently. This is the single biggest technical risk — a type mismatch between frontend and backend can break the product for users without any test catching it. This is prioritised as T1.

Action items

Push pending commit (df2e6781 — startup route validation) to main
Begin T1: resolve shared types divergence
Consider adding the /archive protocol to CLAUDE.md to reduce overhead on future session archiving

​What happened

​Root cause

​Operational takeaway

​Platform backlog: structured into 10 blocks

​Key discovery: shared types problem

​Action items