Skip to main content

TASKSET 6 (RELAY INTEGRATION) - STRATEGIC BUILD PLAN

Component Overview

RELAY (Real-time Event Layer Architecture ⚡ for You) is the Unified Orchestration & Message Routing system - responsible for coordinating all subsystem operations, managing real-time WebSocket connections, and providing the central hub for inter-subsystem communication. Previous Layer: SPAWN (Hydration & Metadata Extraction) ✅ Complete
Current Layer: RELAY (Orchestration & Event Routing) - Starting Now
Next Layer: Complete Integration & Deployment Testing

Architecture Overview

┌──────────────────────────────────────────────────────────┐
│                    RELAY SERVICE                         │
│                (Central Orchestrator)                    │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  ┌─────────────────────────────────────────────────┐   │
│  │      Message Bus & Event Router                 │   │
│  │  • Pub/Sub system                               │   │
│  │  • Event type routing                           │   │
│  │  • Cross-subsystem communication                │   │
│  └─────────────────────────────────────────────────┘   │
│                           ▲                              │
│    ┌─────────┬────────┬──┴───┬──────────┬──────────┐   │
│    │         │        │      │          │          │   │
│  ┌─▼──┐ ┌───▼──┐ ┌───▼─┐ ┌──▼──┐ ┌─────▼─┐ ┌────▼─┐ │
│  │SIFT│ │ CAST │ │SPAWN│ │STICH│ │GATEWAY│ │NOTIF │ │
│  └────┘ └──────┘ └─────┘ └─────┘ └───────┘ └──────┘ │
│                                                          │
│  [Health Check] [Metrics] [Lifecycle] [Coordination]  │
│                                                          │
└──────────────────────────────────────────────────────────┘

4-Stage Build Strategy

STAGE 1: Core Event Bus & Subsystem Registry (30% Effort)

Deliverables: 350 lines, 10+ tests
  • Event bus infrastructure (channels, routing)
  • Subsystem registry (registration, discovery)
  • Health check coordination
  • Lifecycle management (init, ready, shutdown)
Key Methods:
  • RegisterSubsystem() - Register a new subsystem
  • PublishEvent() - Broadcast event to all subscribers
  • SubscribeToEvent() - Listen for specific events
  • HealthCheckAll() - Check all subsystems
  • GracefulShutdown() - Coordinated shutdown
Key Types:
  • RelayEvent - Event envelope
  • SubsystemInfo - Registry entry
  • HealthStatus - Health information
  • EventSubscription - Subscription handle

STAGE 2: WebSocket Real-Time Layer (25% Effort)

Deliverables: 320 lines, 8+ tests
  • WebSocket server setup
  • Connection pooling
  • Message broadcasting
  • Presence management
  • Connection metrics
Key Methods:
  • HandleWebSocketConnection() - Accept WS connections
  • BroadcastToConnections() - Send to active connections
  • ManagePresence() - Track active users/sessions
  • GetConnectionMetrics() - Connection statistics
  • CloseConnection() - Graceful disconnect
Key Types:
  • Connection - WebSocket connection wrapper
  • PresenceInfo - User/session presence
  • ConnectionMetrics - Statistics

STAGE 3: Request Routing & Load Balancing (25% Effort)

Deliverables: 300 lines, 8+ tests
  • Request routing to correct subsystem
  • Load balancing across instances
  • Circuit breaker pattern
  • Request tracing/correlation IDs
  • Rate limiting preparation
Key Methods:
  • RouteRequest() - Route to appropriate subsystem
  • GetHealthySubsystem() - Pick healthy instance
  • RecordRequest() - Track request metrics
  • IsCircuitOpen() - Check circuit breaker
  • ResetCircuitBreaker() - Recovery
Key Types:
  • RoutingRule - Routing configuration
  • CircuitBreakerState - CB state
  • RequestTrace - Tracing info

STAGE 4: Integration & API Endpoints (20% Effort)

Deliverables: 280 lines, 8+ tests
  • REST API endpoints for relay operations
  • Health/readiness probes
  • Metrics/statistics endpoints
  • Service discovery endpoints
  • Event stream endpoints
API Endpoints:
  • GET /api/v1/relay/health - Overall health
  • GET /api/v1/relay/subsystems - List subsystems
  • GET /api/v1/relay/events/stream - SSE event stream
  • POST /api/v1/relay/events - Publish event
  • GET /api/v1/relay/metrics - Metrics dump

Implementation Schedule

PhaseDurationFilesTestsLOC
Stage 12-3hevent_bus.go, registry.go10+350
Stage 22-3hwebsocket.go, presence.go8+320
Stage 32-3hrouter.go, balancer.go8+300
Stage 41-2hhandlers.go, service.go8+280
Testing1-2hrelay_test.go34+400
Total8-13h~8 files34+ tests1,650 lines

Technical Decisions

1. Event Model

Event Structure:
type Event struct {
    ID          string                 `json:"id"`
    Type        EventType              `json:"type"`
    Source      string                 `json:"source"`      // subsystem name
    Destination string                 `json:"destination"` // subsystem name or "*"
    Payload     map[string]interface{} `json:"payload"`
    Timestamp   time.Time              `json:"timestamp"`
    TraceID     string                 `json:"trace_id"`
    CorrelationID string               `json:"correlation_id"`
}

type EventType string

const (
    EventTypeDocumentCreated   EventType = "document.created"
    EventTypeDocumentUpdated   EventType = "document.updated"
    EventTypeQualityScored     EventType = "quality.scored"
    EventTypeLinksResolved     EventType = "links.resolved"
    EventTypeContentEnriched   EventType = "content.enriched"
    EventTypeSyncCompleted     EventType = "sync.completed"
    EventTypeHealthCheckFailed EventType = "health.check_failed"
)

2. Subsystem Registry

Registry Storage:
type SubsystemRegistry struct {
    subsystems map[string]*SubsystemInfo
    healthy    map[string]bool
    mu         sync.RWMutex
}

type SubsystemInfo struct {
    Name        string
    Version     string
    HealthURL   string
    BaseURL     string
    LastHealthCheck time.Time
    Status      string  // "healthy", "degraded", "unhealthy"
}

3. WebSocket Design

  • Use gorilla/websocket for robust implementation
  • Connection pooling with bounded concurrency (1000 concurrent default)
  • Message batching for efficiency
  • Automatic reconnection support
  • Ping/pong for connection keep-alive

4. Routing Strategy

Request Flow:
  1. Parse request path to determine subsystem
  2. Select healthy instance from registry
  3. Forward request with tracing headers
  4. Record metrics (latency, success/failure)
  5. Update circuit breaker state
Routing Rules:
/api/v1/sift/*     → SIFT subsystem
/api/v1/cast/*     → CAST subsystem
/api/v1/spawn/*    → SPAWN subsystem
/api/v1/stitch/*   → STITCH subsystem

Dependencies

Internal

  • pkg/sift - Quality assessment
  • pkg/cast - Citation & tagging
  • pkg/spawn - Hydration & metadata
  • pkg/stitch - Content coordination
  • pkg/gateway - API Gateway integration
  • pkg/config - Configuration management

External

  • github.com/gorilla/websocket - WebSocket handling
  • github.com/google/uuid - UUID generation
  • golang.org/x/time/rate - Rate limiting
  • Standard library: sync, time, context, net/http

Success Criteria

CriterionTargetDefinition
Test Coverage30+ testsUnit + integration tests for all components
Pass Rate100%All tests passing, 0 failures
Performance<50msEvent routing <10ms, WS broadcast <50ms
Compilation0 errorsClean Go build, type-safe throughout
SubsystemsAll 6 activeSIFT, CAST, SPAWN, STITCH, Gateway, Relay
Uptime99.9%Graceful degradation on subsystem failure
Throughput10K eps10,000 events per second minimum

Risk Mitigation

RiskProbabilityImpactMitigation
WebSocket scaling issuesMediumHighConnection pooling, load balancing
Event loop deadlocksLowHighTimeouts, goroutine watchdog
Subsystem crashesLowMediumHealth checks, automatic circuit breaker
Message lossLowHighAt-least-once delivery, message acknowledgment
Performance bottleneckMediumMediumEvent batching, async processing

Go/No-Go Checkpoints

Before Starting

  • All prior SPAWN subsystems working (60/60 tests)
  • Gateway package compiles
  • go.mod dependencies available
  • WebSocket library ready

After Stage 1

  • Event bus tests pass
  • Registry operations verified
  • Health check system working
  • Subsystem can self-register

After Stage 2

  • WebSocket server accepts connections
  • Broadcast to multiple clients working
  • Presence tracking accurate
  • Connection metrics collecting

After Stage 3

  • Requests routed to correct subsystem
  • Load balancing distributes evenly
  • Circuit breaker trips on failures
  • Tracing headers propagated

After Stage 4

  • All API endpoints responding
  • Health probes working
  • Event stream accessible
  • Service discovery functional

Final Verification

  • 30+ tests passing
  • 100% pass rate confirmed
  • Performance targets met (<50ms)
  • Zero compilation errors
  • All subsystems coordinated

File Structure

/Users/alexarno/materi/clari/backend/pkg/relay/
├── types.go              # Core types & enums (120 lines)
├── event_bus.go          # Event routing engine (200 lines)
├── registry.go           # Subsystem registry (180 lines)
├── websocket.go          # WebSocket server (240 lines)
├── presence.go           # Presence tracking (140 lines)
├── router.go             # Request router (180 lines)
├── balancer.go           # Load balancer (160 lines)
├── service.go            # Main relay service (200 lines)
├── handlers.go           # HTTP handlers (150 lines)
├── relay_test.go         # Unit tests (400 lines)
└── integration_test.go   # Integration tests (300 lines)
Total: ~2,270 lines of new code across 11 files

Architecture Patterns

Pub/Sub Pattern: Decoupled event publishing and subscribing
Service Registry Pattern: Dynamic service discovery
Circuit Breaker Pattern: Fault tolerance
Load Balancing: Distribute requests evenly
Health Checks: Continuous subsystem monitoring
Graceful Degradation: Continue if subsystems fail
Correlation IDs: Request tracing across subsystems
Rate Limiting: Prevent resource exhaustion

Integration Points

With SIFT

  • Subscribe to quality.scored events
  • Route quality check requests
  • Health checks for SIFT service

With CAST

  • Subscribe to links.resolved events
  • Route link analysis requests
  • Health checks for CAST service

With SPAWN

  • Subscribe to content.enriched events
  • Route enrichment requests
  • Health checks for SPAWN service

With STITCH

  • Subscribe to sync.completed events
  • Route synchronization requests
  • Health checks for STITCH service

With Gateway

  • Coordinate routing decisions
  • Share connection state
  • Unified health endpoint

Deployment Considerations

  • Port: 8004 (default RELAY port)
  • Memory: ~256MB for event buffers
  • Concurrency: Configurable, default 1000 concurrent WS connections
  • Scaling: Horizontal scaling via event propagation
  • Monitoring: Prometheus metrics export ready

Expected Outcomes

After TASKSET 6 Completion

✅ Unified event-driven architecture
✅ Real-time WebSocket communication layer
✅ Intelligent request routing & load balancing
✅ Subsystem health and lifecycle management
✅ Complete inter-subsystem coordination
✅ Foundation for distributed tracing
✅ Production-ready orchestration layer

Verification

  • 30+ tests passing (100% pass rate)
  • All subsystems coordinated
  • WebSocket connections stable
  • Event routing verified
  • Load balancing effective
  • Health checks responsive
  • Performance targets met

Next Steps After TASKSET 6

  1. TASKSET 7: End-to-End Integration Testing
    • Full workflow tests
    • Performance benchmarks
    • Failure scenario testing
  2. TASKSET 8: Production Deployment
    • Docker containerization
    • Kubernetes manifests
    • CI/CD pipeline setup
    • Monitoring & alerting
  3. TASKSET 9: Operations & Observability
    • Prometheus metrics
    • Distributed tracing (Jaeger)
    • Structured logging
    • Performance tuning

Status: Ready for Implementation
Authority: CTO Approval
Recommended: Proceed with Stage 1 immediately

Document Metadata

  • Created: 2025-12-05
  • Version: 1.0
  • Owner: Clari Systems Architecture Team
  • Status: APPROVED FOR EXECUTION