RabbitMQ Deployment Checklist

This checklist tracks the implementation of RabbitMQ message queue infrastructure for sparki.tools, following the 8-taskset execution strategy defined in Block 10.

Pre-Deployment Verification

AWS Secrets Manager secret sparki/rabbitmq/credentials created with username and password keys
EKS cluster IAM role has permission to read the secret
Certificate for rabbitmq.sparki.tools configured in AWS Certificate Manager (or cert-manager will provision)
DNS record for rabbitmq.sparki.tools ready (Route53 or external DNS)

TASKSET 1: Finalize Contracts + Deployment Parameters

Status: IN PROGRESS
Owner: Platform Team

Deliverables

Sign-off

Platform lead review
Contract approved by api-engine team
Contract approved by deploy-loco team

TASKSET 2: RabbitMQ Core (Helm via Kustomize) + Secrets + Definitions

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 1

Deliverables

Create infra/kubernetes-manifests/base/rabbitmq/ directory structure

rabbitmq/
├── kustomization.yaml        # HelmChart generator
├── values.yaml               # Helm values override
├── namespace.yaml            # rabbitmq namespace
├── secrets/
│   ├── secret-store.yaml     # SecretStore for rabbitmq namespace
│   ├── service-account.yaml  # IRSA ServiceAccount
│   └── external-secret.yaml  # ExternalSecret pulling credentials
└── definitions/
    └── definitions.json      # Exchanges, queues, bindings, policies

kustomization.yaml with Bitnami RabbitMQ Helm chart
- Chart version: latest stable (3.x)
- Replicas: 3
- Persistence: enabled
- Plugins: rabbitmq_management, rabbitmq_prometheus
values.yaml overrides
- Auth from ExternalSecret
- Clustering enabled
- Resource requests/limits
- Prometheus metrics enabled
- Load definitions from ConfigMap
ExternalSecrets pattern
- SecretStore pointing to AWS Secrets Manager
- ServiceAccount with IRSA annotation
- ExternalSecret for rabbitmq-credentials
definitions.json with
- Exchanges: dlx (topic)
- Queues: builds, deployments, notifications, builds.failed, deployments.failed
- Bindings: DLX to DLQ bindings
- Policies: ha-all for classic queues

Verification

# Verify pods are running
kubectl get pods -n rabbitmq -l app.kubernetes.io/name=rabbitmq

# Verify cluster formed
kubectl exec -n rabbitmq rabbitmq-0 -- rabbitmqctl cluster_status

# Verify queues created
kubectl exec -n rabbitmq rabbitmq-0 -- rabbitmqctl list_queues name type durable

# Verify exchanges
kubectl exec -n rabbitmq rabbitmq-0 -- rabbitmqctl list_exchanges name type

Sign-off

3 replicas running
Cluster formed successfully
All queues created
Secrets synced from AWS

TASKSET 3: Management UI Ingress + TLS (Kong)

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 2

Deliverables

Create infra/kubernetes-manifests/base/rabbitmq/ingress/ directory

ingress/
├── certificate.yaml          # cert-manager Certificate
└── rabbitmq-ingress.yaml     # Kong Ingress

Certificate resource
- Domain: rabbitmq.sparki.tools
- Issuer: letsencrypt-prod (or cluster issuer)
- Secret: rabbitmq-tls
Ingress resource
- Host: rabbitmq.sparki.tools
- Backend: rabbitmq.rabbitmq.svc.cluster.local:15672
- TLS enabled
- Kong annotations for rate limiting (optional)

Verification

# Check certificate issued
kubectl get certificate -n rabbitmq

# Check ingress
kubectl get ingress -n rabbitmq

# Test access
curl -I https://rabbitmq.sparki.tools

Sign-off

TLS certificate issued
Management UI accessible at https://rabbitmq.sparki.tools
Authentication working

TASKSET 4: NetworkPolicy Allow Rules

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 2

Deliverables

Create infra/kubernetes-manifests/base/rabbitmq/networkpolicy.yaml
NetworkPolicy rules
- Allow intra-cluster communication (port 25672 for clustering)
- Allow api-engine namespace (port 5672)
- Allow deploy-loco namespace (port 5672)
- Allow Kong namespace (port 15672 for management UI)
- Allow Prometheus namespace (port 15692 for metrics)
- Deny all other ingress by default

Verification

# From api-engine pod
kubectl exec -n sparki-engine <pod> -- nc -zv rabbitmq.rabbitmq.svc.cluster.local 5672

# From unauthorized namespace (should fail)
kubectl exec -n default <pod> -- nc -zv rabbitmq.rabbitmq.svc.cluster.local 5672

Sign-off

api-engine can connect
deploy-loco can connect
Unauthorized namespaces blocked

TASKSET 5: api-engine Producer (Build Publish)

Status: NOT STARTED
Owner: Engine Team
Depends On: TASKSET 2, TASKSET 4

Deliverables

Create services/api-engine/internal/mq/ package

mq/
├── connection.go      # Connection pool management
├── producer.go        # Publish with confirms, retry logic
├── config.go          # Configuration from env vars
└── mq_test.go         # Unit tests

Connection pool
- Pool size: 10 connections
- Heartbeat: 60s
- Auto-reconnect on failure
Producer implementation
- Persistent delivery mode
- Publisher confirms enabled
- Priority support (0-10)
- Message ID for idempotency
- Exponential backoff retry (3 attempts)
Integration with build API
- Replace in-memory queue (internal/executor/queue.go)
- Publish build job to builds queue on API request

Verification

# Run tests
cd services/api-engine && go test ./internal/mq/...

# Integration test
curl -X POST https://api.sparki.tools/builds -d '{"project_id":"test"}'
# Check message in RabbitMQ management UI

Sign-off

Unit tests passing
Build jobs published to queue
Publisher confirms working
Retry logic tested

TASKSET 6: api-engine Consumer Workers

Status: NOT STARTED
Owner: Engine Team
Depends On: TASKSET 5

Deliverables

Create services/api-engine/internal/mq/consumer.go
- Manual acknowledgment
- Prefetch count: 10
- Graceful shutdown (drain on SIGTERM)
- Error classification (transient vs permanent)
Create services/api-engine/cmd/build-worker/main.go
- Separate binary for worker processes
- Configurable concurrency
- Health check endpoint
Kubernetes manifests for workers
- Deployment with HPA
- Service for health checks
- Resource limits
Error handling
- Transient errors: nack + requeue
- Permanent errors: nack without requeue (DLQ)
- Max retry tracking via message headers

Verification

# Deploy workers
kubectl apply -k infra/kubernetes-manifests/overlays/dev/api-engine/

# Check workers consuming
kubectl logs -n sparki-engine -l app=build-worker

# Verify DLQ handling
# Publish invalid message, verify it lands in builds.failed

Sign-off

Workers processing messages
Graceful shutdown working
Failed messages go to DLQ
HPA scaling based on queue depth

TASKSET 7: deploy-loco Publisher

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 2, TASKSET 4

Deliverables

Create services/deploy-loco/src/mq/ module

mq/
├── mod.rs           # Module exports
├── connection.rs    # Connection pool (lapin crate)
├── publisher.rs     # Publish with confirms
└── config.rs        # Configuration

Publisher implementation
- Using lapin crate for AMQP
- Publisher confirms
- Retry with backoff
Integration with deployment flow
- Replace/augment PostgreSQL queue (src/worker/queue.rs)
- Publish deployment notifications to deployments queue

Verification

# Run tests
cd services/deploy-loco && cargo test mq::

# Integration test
# Trigger deployment, verify message in deployments queue

Sign-off

Unit tests passing
Deployment notifications published
Publisher confirms working

TASKSET 8: DLQ Ops, Monitoring, Runbooks

Status: NOT STARTED
Owner: Platform Team
Depends On: TASKSET 6, TASKSET 7

Deliverables

Prometheus ServiceMonitor for RabbitMQ
- Scrape metrics from port 15692
- Labels for Grafana dashboards
Grafana dashboard
- Queue depths
- Message rates (publish/consume)
- DLQ message counts
- Connection counts
- Consumer utilization
AlertManager rules
- DLQ threshold (>10 messages)
- Queue backup (>1000 messages)
- Connection loss
- Consumer starvation
Create services/observability-storm/runbooks/rabbitmq-dlq.md
- DLQ inspection procedure
- Message replay commands
- Common failure patterns
- Escalation paths
CLI tooling (optional)
- sparki-admin mq inspect <queue>
- sparki-admin mq replay <queue> --count N
- sparki-admin mq purge <queue> (with confirmation)

Verification

# Check metrics scraping
curl http://rabbitmq.rabbitmq.svc.cluster.local:15692/metrics | head

# Verify alerts configured
kubectl get prometheusrule -n monitoring

# Test DLQ alert
# Publish 15 messages to builds.failed, verify alert fires

Sign-off

Metrics visible in Grafana
Alerts configured and tested
Runbook reviewed by on-call team
DLQ replay procedure validated

Post-Deployment Verification

Functional Tests

# End-to-end: API to Worker to DLQ
# 1. Submit build via API
curl -X POST https://api.sparki.tools/builds -d '{"project_id":"test","commit_sha":"abc123"}'

# 2. Verify worker picked up job
kubectl logs -n sparki-engine -l app=build-worker --tail=100 | grep "Processing build"

# 3. Simulate failure to test DLQ
# (Implementation specific)

Performance Tests

# Publish throughput test (target: 100 msg/s)
# Consumer processing test (target: 10 concurrent)
# Queue backup recovery test

Rollback Procedure

If RabbitMQ deployment causes issues:

Do not delete the PVCs - messages will be lost
Scale down consumers to stop processing
Revert to previous queue implementation (in-memory/PostgreSQL)
Debug RabbitMQ issues offline
Replay any stuck messages once fixed

References

Contract: platform/platform-docs/system/rabbitmq-contract.yaml
Block 10 Spec: platform/platform-docs/tasksets/TASKSET_EXECUTION_STRATEGY.md (lines 1700-1910)
Bitnami RabbitMQ Chart: https://github.com/bitnami/charts/tree/main/bitnami/rabbitmq
RabbitMQ Quorum Queues: https://www.rabbitmq.com/quorum-queues.html

​RabbitMQ Deployment Checklist

​Pre-Deployment Verification

​TASKSET 1: Finalize Contracts + Deployment Parameters

​Deliverables

​Sign-off

​TASKSET 2: RabbitMQ Core (Helm via Kustomize) + Secrets + Definitions

​Deliverables

​Verification

​Sign-off

​TASKSET 3: Management UI Ingress + TLS (Kong)

​Deliverables

​Verification

​Sign-off

​TASKSET 4: NetworkPolicy Allow Rules

​Deliverables

​Verification

​Sign-off

​TASKSET 5: api-engine Producer (Build Publish)

​Deliverables

​Verification

​Sign-off

​TASKSET 6: api-engine Consumer Workers

​Deliverables

​Verification

​Sign-off

​TASKSET 7: deploy-loco Publisher

​Deliverables

​Verification

​Sign-off

​TASKSET 8: DLQ Ops, Monitoring, Runbooks

​Deliverables

​Verification

​Sign-off

​Post-Deployment Verification

​Functional Tests

​Performance Tests

​Rollback Procedure

​References

RabbitMQ Deployment Checklist

Pre-Deployment Verification

TASKSET 1: Finalize Contracts + Deployment Parameters

Deliverables

Sign-off

TASKSET 2: RabbitMQ Core (Helm via Kustomize) + Secrets + Definitions

Deliverables

Verification

Sign-off

TASKSET 3: Management UI Ingress + TLS (Kong)

Deliverables

Verification

Sign-off

TASKSET 4: NetworkPolicy Allow Rules

Deliverables

Verification

Sign-off

TASKSET 5: api-engine Producer (Build Publish)

Deliverables

Verification

Sign-off

TASKSET 6: api-engine Consumer Workers

Deliverables

Verification

Sign-off

TASKSET 7: deploy-loco Publisher

Deliverables

Verification

Sign-off

TASKSET 8: DLQ Ops, Monitoring, Runbooks

Deliverables

Verification

Sign-off

Post-Deployment Verification

Functional Tests

Performance Tests

Rollback Procedure

References