RabbitMQ Deployment Checklist
This checklist tracks the implementation of RabbitMQ message queue infrastructure for sparki.tools, following the 8-taskset execution strategy defined in Block 10.Pre-Deployment Verification
- AWS Secrets Manager secret
sparki/rabbitmq/credentialscreated withusernameandpasswordkeys - EKS cluster IAM role has permission to read the secret
- Certificate for
rabbitmq.sparki.toolsconfigured in AWS Certificate Manager (or cert-manager will provision) - DNS record for
rabbitmq.sparki.toolsready (Route53 or external DNS)
TASKSET 1: Finalize Contracts + Deployment Parameters
Status: IN PROGRESSOwner: Platform Team
Deliverables
-
Update
TASKSET_EXECUTION_STRATEGY.mdBlock 10 spec- Add
deployments.failedDLQ definition - Update queue types to mixed strategy (classic+HA for
builds, quorum for others) - Add
dead_letter_routing_keytodeploymentsqueue - Clarify DLX bindings
- Add
-
Create
platform/platform-docs/system/rabbitmq-contract.yaml- Connection parameters
- Exchange definitions
- Queue definitions with arguments
- Message schemas (build_job, deployment_notification)
- Producer requirements
- Consumer requirements
- Monitoring/alerting thresholds
- Operational procedures
-
Create
platform/platform-docs/runbooks/rabbitmq-deployment-checklist.md(this file)
Sign-off
- Platform lead review
- Contract approved by api-engine team
- Contract approved by deploy-loco team
TASKSET 2: RabbitMQ Core (Helm via Kustomize) + Secrets + Definitions
Status: NOT STARTEDOwner: Platform Team
Depends On: TASKSET 1
Deliverables
-
Create
infra/kubernetes-manifests/base/rabbitmq/directory structure -
kustomization.yamlwith Bitnami RabbitMQ Helm chart- Chart version: latest stable (3.x)
- Replicas: 3
- Persistence: enabled
- Plugins:
rabbitmq_management,rabbitmq_prometheus
-
values.yamloverrides- Auth from ExternalSecret
- Clustering enabled
- Resource requests/limits
- Prometheus metrics enabled
- Load definitions from ConfigMap
-
ExternalSecrets pattern
- SecretStore pointing to AWS Secrets Manager
- ServiceAccount with IRSA annotation
- ExternalSecret for
rabbitmq-credentials
-
definitions.jsonwith- Exchanges:
dlx(topic) - Queues:
builds,deployments,notifications,builds.failed,deployments.failed - Bindings: DLX to DLQ bindings
- Policies:
ha-allfor classic queues
- Exchanges:
Verification
Sign-off
- 3 replicas running
- Cluster formed successfully
- All queues created
- Secrets synced from AWS
TASKSET 3: Management UI Ingress + TLS (Kong)
Status: NOT STARTEDOwner: Platform Team
Depends On: TASKSET 2
Deliverables
-
Create
infra/kubernetes-manifests/base/rabbitmq/ingress/directory -
Certificate resource
- Domain:
rabbitmq.sparki.tools - Issuer: letsencrypt-prod (or cluster issuer)
- Secret:
rabbitmq-tls
- Domain:
-
Ingress resource
- Host:
rabbitmq.sparki.tools - Backend:
rabbitmq.rabbitmq.svc.cluster.local:15672 - TLS enabled
- Kong annotations for rate limiting (optional)
- Host:
Verification
Sign-off
- TLS certificate issued
- Management UI accessible at
https://rabbitmq.sparki.tools - Authentication working
TASKSET 4: NetworkPolicy Allow Rules
Status: NOT STARTEDOwner: Platform Team
Depends On: TASKSET 2
Deliverables
-
Create
infra/kubernetes-manifests/base/rabbitmq/networkpolicy.yaml -
NetworkPolicy rules
- Allow intra-cluster communication (port 25672 for clustering)
- Allow api-engine namespace (port 5672)
- Allow deploy-loco namespace (port 5672)
- Allow Kong namespace (port 15672 for management UI)
- Allow Prometheus namespace (port 15692 for metrics)
- Deny all other ingress by default
Verification
Sign-off
- api-engine can connect
- deploy-loco can connect
- Unauthorized namespaces blocked
TASKSET 5: api-engine Producer (Build Publish)
Status: NOT STARTEDOwner: Engine Team
Depends On: TASKSET 2, TASKSET 4
Deliverables
-
Create
services/api-engine/internal/mq/package -
Connection pool
- Pool size: 10 connections
- Heartbeat: 60s
- Auto-reconnect on failure
-
Producer implementation
- Persistent delivery mode
- Publisher confirms enabled
- Priority support (0-10)
- Message ID for idempotency
- Exponential backoff retry (3 attempts)
-
Integration with build API
- Replace in-memory queue (
internal/executor/queue.go) - Publish build job to
buildsqueue on API request
- Replace in-memory queue (
Verification
Sign-off
- Unit tests passing
- Build jobs published to queue
- Publisher confirms working
- Retry logic tested
TASKSET 6: api-engine Consumer Workers
Status: NOT STARTEDOwner: Engine Team
Depends On: TASKSET 5
Deliverables
-
Create
services/api-engine/internal/mq/consumer.go- Manual acknowledgment
- Prefetch count: 10
- Graceful shutdown (drain on SIGTERM)
- Error classification (transient vs permanent)
-
Create
services/api-engine/cmd/build-worker/main.go- Separate binary for worker processes
- Configurable concurrency
- Health check endpoint
-
Kubernetes manifests for workers
- Deployment with HPA
- Service for health checks
- Resource limits
-
Error handling
- Transient errors: nack + requeue
- Permanent errors: nack without requeue (DLQ)
- Max retry tracking via message headers
Verification
Sign-off
- Workers processing messages
- Graceful shutdown working
- Failed messages go to DLQ
- HPA scaling based on queue depth
TASKSET 7: deploy-loco Publisher
Status: NOT STARTEDOwner: Platform Team
Depends On: TASKSET 2, TASKSET 4
Deliverables
-
Create
services/deploy-loco/src/mq/module -
Publisher implementation
- Using
lapincrate for AMQP - Publisher confirms
- Retry with backoff
- Using
-
Integration with deployment flow
- Replace/augment PostgreSQL queue (
src/worker/queue.rs) - Publish deployment notifications to
deploymentsqueue
- Replace/augment PostgreSQL queue (
Verification
Sign-off
- Unit tests passing
- Deployment notifications published
- Publisher confirms working
TASKSET 8: DLQ Ops, Monitoring, Runbooks
Status: NOT STARTEDOwner: Platform Team
Depends On: TASKSET 6, TASKSET 7
Deliverables
-
Prometheus ServiceMonitor for RabbitMQ
- Scrape metrics from port 15692
- Labels for Grafana dashboards
-
Grafana dashboard
- Queue depths
- Message rates (publish/consume)
- DLQ message counts
- Connection counts
- Consumer utilization
-
AlertManager rules
- DLQ threshold (>10 messages)
- Queue backup (>1000 messages)
- Connection loss
- Consumer starvation
-
Create
services/observability-storm/runbooks/rabbitmq-dlq.md- DLQ inspection procedure
- Message replay commands
- Common failure patterns
- Escalation paths
-
CLI tooling (optional)
sparki-admin mq inspect <queue>sparki-admin mq replay <queue> --count Nsparki-admin mq purge <queue>(with confirmation)
Verification
Sign-off
- Metrics visible in Grafana
- Alerts configured and tested
- Runbook reviewed by on-call team
- DLQ replay procedure validated
Post-Deployment Verification
Functional Tests
Performance Tests
Rollback Procedure
If RabbitMQ deployment causes issues:- Do not delete the PVCs - messages will be lost
- Scale down consumers to stop processing
- Revert to previous queue implementation (in-memory/PostgreSQL)
- Debug RabbitMQ issues offline
- Replay any stuck messages once fixed
References
- Contract:
platform/platform-docs/system/rabbitmq-contract.yaml - Block 10 Spec:
platform/platform-docs/tasksets/TASKSET_EXECUTION_STRATEGY.md(lines 1700-1910) - Bitnami RabbitMQ Chart: https://github.com/bitnami/charts/tree/main/bitnami/rabbitmq
- RabbitMQ Quorum Queues: https://www.rabbitmq.com/quorum-queues.html