TASKSET 6 (RELAY INTEGRATION) - STRATEGIC BUILD PLAN
Component Overview
RELAY (Real-time Event Layer Architecture ⚡ for You) is the Unified Orchestration & Message Routing system - responsible for coordinating all subsystem operations, managing real-time WebSocket connections, and providing the central hub for inter-subsystem communication. Previous Layer: SPAWN (Hydration & Metadata Extraction) ✅ CompleteCurrent Layer: RELAY (Orchestration & Event Routing) - Starting Now
Next Layer: Complete Integration & Deployment Testing
Architecture Overview
4-Stage Build Strategy
STAGE 1: Core Event Bus & Subsystem Registry (30% Effort)
Deliverables: 350 lines, 10+ tests- Event bus infrastructure (channels, routing)
- Subsystem registry (registration, discovery)
- Health check coordination
- Lifecycle management (init, ready, shutdown)
RegisterSubsystem()- Register a new subsystemPublishEvent()- Broadcast event to all subscribersSubscribeToEvent()- Listen for specific eventsHealthCheckAll()- Check all subsystemsGracefulShutdown()- Coordinated shutdown
RelayEvent- Event envelopeSubsystemInfo- Registry entryHealthStatus- Health informationEventSubscription- Subscription handle
STAGE 2: WebSocket Real-Time Layer (25% Effort)
Deliverables: 320 lines, 8+ tests- WebSocket server setup
- Connection pooling
- Message broadcasting
- Presence management
- Connection metrics
HandleWebSocketConnection()- Accept WS connectionsBroadcastToConnections()- Send to active connectionsManagePresence()- Track active users/sessionsGetConnectionMetrics()- Connection statisticsCloseConnection()- Graceful disconnect
Connection- WebSocket connection wrapperPresenceInfo- User/session presenceConnectionMetrics- Statistics
STAGE 3: Request Routing & Load Balancing (25% Effort)
Deliverables: 300 lines, 8+ tests- Request routing to correct subsystem
- Load balancing across instances
- Circuit breaker pattern
- Request tracing/correlation IDs
- Rate limiting preparation
RouteRequest()- Route to appropriate subsystemGetHealthySubsystem()- Pick healthy instanceRecordRequest()- Track request metricsIsCircuitOpen()- Check circuit breakerResetCircuitBreaker()- Recovery
RoutingRule- Routing configurationCircuitBreakerState- CB stateRequestTrace- Tracing info
STAGE 4: Integration & API Endpoints (20% Effort)
Deliverables: 280 lines, 8+ tests- REST API endpoints for relay operations
- Health/readiness probes
- Metrics/statistics endpoints
- Service discovery endpoints
- Event stream endpoints
GET /api/v1/relay/health- Overall healthGET /api/v1/relay/subsystems- List subsystemsGET /api/v1/relay/events/stream- SSE event streamPOST /api/v1/relay/events- Publish eventGET /api/v1/relay/metrics- Metrics dump
Implementation Schedule
| Phase | Duration | Files | Tests | LOC |
|---|---|---|---|---|
| Stage 1 | 2-3h | event_bus.go, registry.go | 10+ | 350 |
| Stage 2 | 2-3h | websocket.go, presence.go | 8+ | 320 |
| Stage 3 | 2-3h | router.go, balancer.go | 8+ | 300 |
| Stage 4 | 1-2h | handlers.go, service.go | 8+ | 280 |
| Testing | 1-2h | relay_test.go | 34+ | 400 |
| Total | 8-13h | ~8 files | 34+ tests | 1,650 lines |
Technical Decisions
1. Event Model
Event Structure:2. Subsystem Registry
Registry Storage:3. WebSocket Design
- Use gorilla/websocket for robust implementation
- Connection pooling with bounded concurrency (1000 concurrent default)
- Message batching for efficiency
- Automatic reconnection support
- Ping/pong for connection keep-alive
4. Routing Strategy
Request Flow:- Parse request path to determine subsystem
- Select healthy instance from registry
- Forward request with tracing headers
- Record metrics (latency, success/failure)
- Update circuit breaker state
Dependencies
Internal
pkg/sift- Quality assessmentpkg/cast- Citation & taggingpkg/spawn- Hydration & metadatapkg/stitch- Content coordinationpkg/gateway- API Gateway integrationpkg/config- Configuration management
External
github.com/gorilla/websocket- WebSocket handlinggithub.com/google/uuid- UUID generationgolang.org/x/time/rate- Rate limiting- Standard library:
sync,time,context,net/http
Success Criteria
| Criterion | Target | Definition |
|---|---|---|
| Test Coverage | 30+ tests | Unit + integration tests for all components |
| Pass Rate | 100% | All tests passing, 0 failures |
| Performance | <50ms | Event routing <10ms, WS broadcast <50ms |
| Compilation | 0 errors | Clean Go build, type-safe throughout |
| Subsystems | All 6 active | SIFT, CAST, SPAWN, STITCH, Gateway, Relay |
| Uptime | 99.9% | Graceful degradation on subsystem failure |
| Throughput | 10K eps | 10,000 events per second minimum |
Risk Mitigation
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| WebSocket scaling issues | Medium | High | Connection pooling, load balancing |
| Event loop deadlocks | Low | High | Timeouts, goroutine watchdog |
| Subsystem crashes | Low | Medium | Health checks, automatic circuit breaker |
| Message loss | Low | High | At-least-once delivery, message acknowledgment |
| Performance bottleneck | Medium | Medium | Event batching, async processing |
Go/No-Go Checkpoints
Before Starting
- All prior SPAWN subsystems working (60/60 tests)
- Gateway package compiles
- go.mod dependencies available
- WebSocket library ready
After Stage 1
- Event bus tests pass
- Registry operations verified
- Health check system working
- Subsystem can self-register
After Stage 2
- WebSocket server accepts connections
- Broadcast to multiple clients working
- Presence tracking accurate
- Connection metrics collecting
After Stage 3
- Requests routed to correct subsystem
- Load balancing distributes evenly
- Circuit breaker trips on failures
- Tracing headers propagated
After Stage 4
- All API endpoints responding
- Health probes working
- Event stream accessible
- Service discovery functional
Final Verification
- 30+ tests passing
- 100% pass rate confirmed
- Performance targets met (<50ms)
- Zero compilation errors
- All subsystems coordinated
File Structure
Architecture Patterns
✅ Pub/Sub Pattern: Decoupled event publishing and subscribing✅ Service Registry Pattern: Dynamic service discovery
✅ Circuit Breaker Pattern: Fault tolerance
✅ Load Balancing: Distribute requests evenly
✅ Health Checks: Continuous subsystem monitoring
✅ Graceful Degradation: Continue if subsystems fail
✅ Correlation IDs: Request tracing across subsystems
✅ Rate Limiting: Prevent resource exhaustion
Integration Points
With SIFT
- Subscribe to
quality.scoredevents - Route quality check requests
- Health checks for SIFT service
With CAST
- Subscribe to
links.resolvedevents - Route link analysis requests
- Health checks for CAST service
With SPAWN
- Subscribe to
content.enrichedevents - Route enrichment requests
- Health checks for SPAWN service
With STITCH
- Subscribe to
sync.completedevents - Route synchronization requests
- Health checks for STITCH service
With Gateway
- Coordinate routing decisions
- Share connection state
- Unified health endpoint
Deployment Considerations
- Port: 8004 (default RELAY port)
- Memory: ~256MB for event buffers
- Concurrency: Configurable, default 1000 concurrent WS connections
- Scaling: Horizontal scaling via event propagation
- Monitoring: Prometheus metrics export ready
Expected Outcomes
After TASKSET 6 Completion
✅ Unified event-driven architecture✅ Real-time WebSocket communication layer
✅ Intelligent request routing & load balancing
✅ Subsystem health and lifecycle management
✅ Complete inter-subsystem coordination
✅ Foundation for distributed tracing
✅ Production-ready orchestration layer
Verification
- 30+ tests passing (100% pass rate)
- All subsystems coordinated
- WebSocket connections stable
- Event routing verified
- Load balancing effective
- Health checks responsive
- Performance targets met
Next Steps After TASKSET 6
-
TASKSET 7: End-to-End Integration Testing
- Full workflow tests
- Performance benchmarks
- Failure scenario testing
-
TASKSET 8: Production Deployment
- Docker containerization
- Kubernetes manifests
- CI/CD pipeline setup
- Monitoring & alerting
-
TASKSET 9: Operations & Observability
- Prometheus metrics
- Distributed tracing (Jaeger)
- Structured logging
- Performance tuning
Status: Ready for Implementation
Authority: CTO Approval
Recommended: Proceed with Stage 1 immediately
Document Metadata
- Created: 2025-12-05
- Version: 1.0
- Owner: Clari Systems Architecture Team
- Status: APPROVED FOR EXECUTION