Skip to main content

RabbitMQ Deployment Checklist

This checklist tracks the implementation of RabbitMQ message queue infrastructure for sparki.tools, following the 8-taskset execution strategy defined in Block 10.

Pre-Deployment Verification

  • AWS Secrets Manager secret sparki/rabbitmq/credentials created with username and password keys
  • EKS cluster IAM role has permission to read the secret
  • Certificate for rabbitmq.sparki.tools configured in AWS Certificate Manager (or cert-manager will provision)
  • DNS record for rabbitmq.sparki.tools ready (Route53 or external DNS)

TASKSET 1: Finalize Contracts + Deployment Parameters

Status: IN PROGRESS
Owner: Platform Team

Deliverables

  • Update TASKSET_EXECUTION_STRATEGY.md Block 10 spec
    • Add deployments.failed DLQ definition
    • Update queue types to mixed strategy (classic+HA for builds, quorum for others)
    • Add dead_letter_routing_key to deployments queue
    • Clarify DLX bindings
  • Create platform/platform-docs/system/rabbitmq-contract.yaml
    • Connection parameters
    • Exchange definitions
    • Queue definitions with arguments
    • Message schemas (build_job, deployment_notification)
    • Producer requirements
    • Consumer requirements
    • Monitoring/alerting thresholds
    • Operational procedures
  • Create platform/platform-docs/runbooks/rabbitmq-deployment-checklist.md (this file)

Sign-off

  • Platform lead review
  • Contract approved by api-engine team
  • Contract approved by deploy-loco team

TASKSET 2: RabbitMQ Core (Helm via Kustomize) + Secrets + Definitions

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 1

Deliverables

  • Create infra/kubernetes-manifests/base/rabbitmq/ directory structure
    rabbitmq/
    ├── kustomization.yaml        # HelmChart generator
    ├── values.yaml               # Helm values override
    ├── namespace.yaml            # rabbitmq namespace
    ├── secrets/
    │   ├── secret-store.yaml     # SecretStore for rabbitmq namespace
    │   ├── service-account.yaml  # IRSA ServiceAccount
    │   └── external-secret.yaml  # ExternalSecret pulling credentials
    └── definitions/
        └── definitions.json      # Exchanges, queues, bindings, policies
    
  • kustomization.yaml with Bitnami RabbitMQ Helm chart
    • Chart version: latest stable (3.x)
    • Replicas: 3
    • Persistence: enabled
    • Plugins: rabbitmq_management, rabbitmq_prometheus
  • values.yaml overrides
    • Auth from ExternalSecret
    • Clustering enabled
    • Resource requests/limits
    • Prometheus metrics enabled
    • Load definitions from ConfigMap
  • ExternalSecrets pattern
    • SecretStore pointing to AWS Secrets Manager
    • ServiceAccount with IRSA annotation
    • ExternalSecret for rabbitmq-credentials
  • definitions.json with
    • Exchanges: dlx (topic)
    • Queues: builds, deployments, notifications, builds.failed, deployments.failed
    • Bindings: DLX to DLQ bindings
    • Policies: ha-all for classic queues

Verification

# Verify pods are running
kubectl get pods -n rabbitmq -l app.kubernetes.io/name=rabbitmq

# Verify cluster formed
kubectl exec -n rabbitmq rabbitmq-0 -- rabbitmqctl cluster_status

# Verify queues created
kubectl exec -n rabbitmq rabbitmq-0 -- rabbitmqctl list_queues name type durable

# Verify exchanges
kubectl exec -n rabbitmq rabbitmq-0 -- rabbitmqctl list_exchanges name type

Sign-off

  • 3 replicas running
  • Cluster formed successfully
  • All queues created
  • Secrets synced from AWS

TASKSET 3: Management UI Ingress + TLS (Kong)

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 2

Deliverables

  • Create infra/kubernetes-manifests/base/rabbitmq/ingress/ directory
    ingress/
    ├── certificate.yaml          # cert-manager Certificate
    └── rabbitmq-ingress.yaml     # Kong Ingress
    
  • Certificate resource
    • Domain: rabbitmq.sparki.tools
    • Issuer: letsencrypt-prod (or cluster issuer)
    • Secret: rabbitmq-tls
  • Ingress resource
    • Host: rabbitmq.sparki.tools
    • Backend: rabbitmq.rabbitmq.svc.cluster.local:15672
    • TLS enabled
    • Kong annotations for rate limiting (optional)

Verification

# Check certificate issued
kubectl get certificate -n rabbitmq

# Check ingress
kubectl get ingress -n rabbitmq

# Test access
curl -I https://rabbitmq.sparki.tools

Sign-off

  • TLS certificate issued
  • Management UI accessible at https://rabbitmq.sparki.tools
  • Authentication working

TASKSET 4: NetworkPolicy Allow Rules

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 2

Deliverables

  • Create infra/kubernetes-manifests/base/rabbitmq/networkpolicy.yaml
  • NetworkPolicy rules
    • Allow intra-cluster communication (port 25672 for clustering)
    • Allow api-engine namespace (port 5672)
    • Allow deploy-loco namespace (port 5672)
    • Allow Kong namespace (port 15672 for management UI)
    • Allow Prometheus namespace (port 15692 for metrics)
    • Deny all other ingress by default

Verification

# From api-engine pod
kubectl exec -n sparki-engine <pod> -- nc -zv rabbitmq.rabbitmq.svc.cluster.local 5672

# From unauthorized namespace (should fail)
kubectl exec -n default <pod> -- nc -zv rabbitmq.rabbitmq.svc.cluster.local 5672

Sign-off

  • api-engine can connect
  • deploy-loco can connect
  • Unauthorized namespaces blocked

TASKSET 5: api-engine Producer (Build Publish)

Status: NOT STARTED
Owner: Engine Team
Depends On: TASKSET 2, TASKSET 4

Deliverables

  • Create services/api-engine/internal/mq/ package
    mq/
    ├── connection.go      # Connection pool management
    ├── producer.go        # Publish with confirms, retry logic
    ├── config.go          # Configuration from env vars
    └── mq_test.go         # Unit tests
    
  • Connection pool
    • Pool size: 10 connections
    • Heartbeat: 60s
    • Auto-reconnect on failure
  • Producer implementation
    • Persistent delivery mode
    • Publisher confirms enabled
    • Priority support (0-10)
    • Message ID for idempotency
    • Exponential backoff retry (3 attempts)
  • Integration with build API
    • Replace in-memory queue (internal/executor/queue.go)
    • Publish build job to builds queue on API request

Verification

# Run tests
cd services/api-engine && go test ./internal/mq/...

# Integration test
curl -X POST https://api.sparki.tools/builds -d '{"project_id":"test"}'
# Check message in RabbitMQ management UI

Sign-off

  • Unit tests passing
  • Build jobs published to queue
  • Publisher confirms working
  • Retry logic tested

TASKSET 6: api-engine Consumer Workers

Status: NOT STARTED
Owner: Engine Team
Depends On: TASKSET 5

Deliverables

  • Create services/api-engine/internal/mq/consumer.go
    • Manual acknowledgment
    • Prefetch count: 10
    • Graceful shutdown (drain on SIGTERM)
    • Error classification (transient vs permanent)
  • Create services/api-engine/cmd/build-worker/main.go
    • Separate binary for worker processes
    • Configurable concurrency
    • Health check endpoint
  • Kubernetes manifests for workers
    • Deployment with HPA
    • Service for health checks
    • Resource limits
  • Error handling
    • Transient errors: nack + requeue
    • Permanent errors: nack without requeue (DLQ)
    • Max retry tracking via message headers

Verification

# Deploy workers
kubectl apply -k infra/kubernetes-manifests/overlays/dev/api-engine/

# Check workers consuming
kubectl logs -n sparki-engine -l app=build-worker

# Verify DLQ handling
# Publish invalid message, verify it lands in builds.failed

Sign-off

  • Workers processing messages
  • Graceful shutdown working
  • Failed messages go to DLQ
  • HPA scaling based on queue depth

TASKSET 7: deploy-loco Publisher

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 2, TASKSET 4

Deliverables

  • Create services/deploy-loco/src/mq/ module
    mq/
    ├── mod.rs           # Module exports
    ├── connection.rs    # Connection pool (lapin crate)
    ├── publisher.rs     # Publish with confirms
    └── config.rs        # Configuration
    
  • Publisher implementation
    • Using lapin crate for AMQP
    • Publisher confirms
    • Retry with backoff
  • Integration with deployment flow
    • Replace/augment PostgreSQL queue (src/worker/queue.rs)
    • Publish deployment notifications to deployments queue

Verification

# Run tests
cd services/deploy-loco && cargo test mq::

# Integration test
# Trigger deployment, verify message in deployments queue

Sign-off

  • Unit tests passing
  • Deployment notifications published
  • Publisher confirms working

TASKSET 8: DLQ Ops, Monitoring, Runbooks

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 6, TASKSET 7

Deliverables

  • Prometheus ServiceMonitor for RabbitMQ
    • Scrape metrics from port 15692
    • Labels for Grafana dashboards
  • Grafana dashboard
    • Queue depths
    • Message rates (publish/consume)
    • DLQ message counts
    • Connection counts
    • Consumer utilization
  • AlertManager rules
    • DLQ threshold (>10 messages)
    • Queue backup (>1000 messages)
    • Connection loss
    • Consumer starvation
  • Create services/observability-storm/runbooks/rabbitmq-dlq.md
    • DLQ inspection procedure
    • Message replay commands
    • Common failure patterns
    • Escalation paths
  • CLI tooling (optional)
    • sparki-admin mq inspect <queue>
    • sparki-admin mq replay <queue> --count N
    • sparki-admin mq purge <queue> (with confirmation)

Verification

# Check metrics scraping
curl http://rabbitmq.rabbitmq.svc.cluster.local:15692/metrics | head

# Verify alerts configured
kubectl get prometheusrule -n monitoring

# Test DLQ alert
# Publish 15 messages to builds.failed, verify alert fires

Sign-off

  • Metrics visible in Grafana
  • Alerts configured and tested
  • Runbook reviewed by on-call team
  • DLQ replay procedure validated

Post-Deployment Verification

Functional Tests

# End-to-end: API to Worker to DLQ
# 1. Submit build via API
curl -X POST https://api.sparki.tools/builds -d '{"project_id":"test","commit_sha":"abc123"}'

# 2. Verify worker picked up job
kubectl logs -n sparki-engine -l app=build-worker --tail=100 | grep "Processing build"

# 3. Simulate failure to test DLQ
# (Implementation specific)

Performance Tests

# Publish throughput test (target: 100 msg/s)
# Consumer processing test (target: 10 concurrent)
# Queue backup recovery test

Rollback Procedure

If RabbitMQ deployment causes issues:
  1. Do not delete the PVCs - messages will be lost
  2. Scale down consumers to stop processing
  3. Revert to previous queue implementation (in-memory/PostgreSQL)
  4. Debug RabbitMQ issues offline
  5. Replay any stuck messages once fixed

References