What happened
Two pages on rover.so1.io (MCP Registry and Agents) were returning 404 errors — completely broken for any user visiting them. This was a production outage affecting platform visibility into MCP tooling.Root cause
The backend API (BFF) was being deployed from an old, standalone repository instead of the current monorepo. When new features were added to the monorepo, the deployed version never received them. The pages worked locally but failed in production because Railway (our hosting provider) was pointed at stale code. Resolution: Switched Railway to deploy from the monorepo (so1-io/so1-console) targeting the correct subdirectory. Pages load correctly after redeploy. Added an automated check that logs warnings on startup if expected API routes are missing — early detection for future deployment misconfigurations.
Operational takeaway
This class of bug — “works locally, broken in production” — is caused by deployment source drift. When infrastructure points at a different repo than where development happens, changes silently diverge. The fix is configuration, not code. Added a startup self-check so this category of failure announces itself in logs rather than silently serving 404s.Platform backlog: structured into 10 blocks
Audited all 34 open GitHub issues against the actual codebase. Closed 4 that were already resolved or stale. Organised the remaining 30 into 10 prioritised work blocks:| Block | Focus | Business impact |
|---|---|---|
| T1 | Shared types sync + CI | Prevents type drift that breaks both frontend and backend simultaneously |
| T2 | Build verification in CI | Catches broken deployments before they ship |
| T3 | CI efficiency | Reduces feedback time and compute costs |
| T4 | Developer guardrails | Pre-commit hooks, dependency automation, secrets scanning |
| T5 | API test coverage + error quality | Better error messages for users, confidence in integrations |
| T6 | Landing page CI + cleanup | Validates the marketing site doesn’t break silently |
| T7 | Landing polish + favicon | Animation consistency, missing browser tab icon |
| T8 | Extended CI (MCP servers, standalone) | Safety net for secondary services |
| T9 | Cross-browser testing + releases | Safari/Firefox coverage, automated changelogs |
| T10 | Deployment pipeline | Staging gates, rollback capability, audit trail |
Key discovery: shared types problem
The platform has three separate copies of its type definitions that drift independently. This is the single biggest technical risk — a type mismatch between frontend and backend can break the product for users without any test catching it. This is prioritised as T1.Action items
- Push pending commit (
df2e6781— startup route validation) to main - Begin T1: resolve shared types divergence
- Consider adding the
/archiveprotocol to CLAUDE.md to reduce overhead on future session archiving