Skip to main content

Architecture Decision Records (ADRs)

Index: Sparki Architectural Decisions
Format: MADR 3.0

  1. ADR-001: Kubernetes for Container Orchestration
  2. ADR-002: PostgreSQL for Primary Data Store
  3. ADR-003: Redis for Caching Layer
  4. ADR-004: Terraform for Infrastructure as Code
  5. ADR-005: Multi-Region High Availability Architecture
  6. ADR-006: GitHub Actions for CI/CD
  7. ADR-007: gRPC + REST API Gateway
  8. ADR-008: Microservices Architecture Pattern

ADR-001: Kubernetes for Container Orchestration

Date: January 2025
Status: Accepted
Deciders: Architecture Team, DevOps Lead

Context

Sparki needed a container orchestration platform supporting:
  • Auto-scaling and load balancing
  • Rolling deployments and rollbacks
  • Multi-region deployment
  • Self-healing and high availability
  • Cost optimization for cloud infrastructure

Decision

Adopt Kubernetes (EKS on AWS) as the primary container orchestration platform.

Rationale

Advantages:
  • Industry Standard: Kubernetes is the de facto standard with massive community support
  • Scalability: Handles 100+ services across multiple regions seamlessly
  • Self-Healing: Automatic pod restart, health checking, and recovery
  • Rolling Updates: Zero-downtime deployments with automatic rollback
  • Multi-Cloud Ready: Can migrate to GKE or AKS without application changes
  • Cost Optimization: Spot instances, autoscaling, resource optimization tools
  • Declarative Configuration: Infrastructure as Code via manifests/Helm
Why Not Alternatives:
  • Docker Swarm: Limited to single datacenter, less mature ecosystem
  • ECS: AWS-only, vendor lock-in, less flexible than Kubernetes
  • Nomad: Good but smaller ecosystem, fewer third-party tools
  • CloudRun: Serverless-only, no multi-region control, startup latency issues

Consequences

Positive:
  • ✅ Attracts talent familiar with Kubernetes
  • ✅ Rich ecosystem of tools (Helm, ArgoCD, Istio, etc.)
  • ✅ Strong vendor support (AWS EKS, managed control plane)
  • ✅ Excellent multi-region and HA capabilities
Negative:
  • ❌ Operational complexity (learning curve, troubleshooting)
  • ❌ Additional cost for managed Kubernetes vs. simpler alternatives
  • ❌ Requires skilled DevOps/SRE team

Implementation Notes

# Cluster Configuration
- AWS EKS (Kubernetes 1.27+)
- 10 worker nodes, t3.large instances
- Multi-AZ deployment (3 availability zones)
- Auto-scaling: 10-50 nodes based on load
- Pod Security Standards: restricted profile enforced
- Network Policy: Enabled, default deny egress/ingress

ADR-002: PostgreSQL for Primary Data Store

Date: January 2025
Status: Accepted
Deciders: Data Lead, Backend Architects

Context

Sparki handles user accounts, project data, audit logs, and operational metadata requiring:
  • ACID compliance for data integrity
  • Complex queries and reporting
  • Strong consistency guarantees
  • Audit trail capabilities
  • Schema evolution support

Decision

Use PostgreSQL as the primary relational database.

Rationale

Advantages:
  • ACID Compliance: Strong consistency guarantees for financial/audit data
  • Advanced Features: JSON/JSONB, arrays, range types, full-text search
  • Performance: Excellent query performance with proper indexing
  • Reliability: Used in production at scale by Fortune 500 companies
  • Open Source: Community-driven, no vendor lock-in
  • Security: Row-level security, encryption at rest/in transit supported
  • Replication: Native streaming replication for HA and read scaling
Why Not Alternatives:
  • MySQL: Less advanced features, weaker JSON support
  • MongoDB: CAP theorem trade-offs (eventual consistency), schema-less issues at scale
  • DynamoDB: Serverless advantage offset by high costs and eventual consistency
  • Cassandra: Eventual consistency, complex operational model

Consequences

Positive:
  • ✅ Excellent for operational data and analytics
  • ✅ Strong consistency prevents data corruption
  • ✅ Mature backup and recovery tools
  • ✅ Good cost/performance ratio at scale
Negative:
  • ❌ Vertical scaling limits (need read replicas for horizontal)
  • ❌ Schema changes require careful planning at large scale
  • ❌ Replication lag on read replicas (eventual consistency for reads)

Implementation Notes

# PostgreSQL Configuration
Primary:
    - RDS Multi-AZ (us-east-1 and us-west-2)
    - Version: PostgreSQL 15
    - Instance: db.r6i.2xlarge
    - Storage: 500GB gp3
    - Backup: Daily automated, 30-day retention
    - Encryption: At-rest (AWS KMS), in-transit (SSL)

Read Replicas:
    - 2 replicas per region for read scaling
    - Async replication (monitoring for lag)
    - Cross-region replicas for disaster recovery

Connection Pooling:
    - PgBouncer (transaction pooling)
    - Max connections: 200 per replica
    - Pool size: 50 per application pod

ADR-003: Redis for Caching Layer

Date: January 2025
Status: Accepted
Deciders: Platform Architecture, Performance Team

Context

Sparki experiences cache-heavy access patterns:
  • Session/authentication tokens
  • Project metadata (frequently accessed)
  • Rate limit counters
  • Real-time metrics aggregation
  • Temporary computation results

Decision

Use Redis Cluster with Sentinel for high availability caching.

Rationale

Advantages:
  • Performance: Sub-millisecond response times (in-memory)
  • Data Structures: Native support for strings, lists, sets, sorted sets, streams
  • High Availability: Redis Sentinel for automatic failover
  • Replication: Master-slave replication across AZs
  • Persistence: Optional persistence (RDB snapshots, AOF logs)
  • Pub/Sub: Real-time messaging for WebSocket support
Why Not Alternatives:
  • Memcached: No persistence, no HA without external tooling
  • DynamoDB: Much slower for cache use case, unnecessary overhead
  • Elasticsearch: Overkill for simple caching, write-heavy
  • Local Memory: Single-node failure = data loss, no sharing across pods

Consequences

Positive:
  • ✅ Dramatic performance improvement (10-100x faster than DB)
  • ✅ Reduces database load and costs
  • ✅ Enables real-time features (Pub/Sub, WebSockets)
Negative:
  • ❌ Additional operational complexity (Sentinel monitoring)
  • ❌ Increased infrastructure cost
  • ❌ Memory limits (can’t cache entire dataset)

Implementation Notes

# Redis Configuration
Cluster Setup:
    - Redis Cluster (6 nodes, 3 replicas)
    - ElastiCache (Redis 7.0)
    - Node Type: cache.r6g.xlarge
    - Automatic failover enabled
    - Multi-AZ deployment

Caching Strategy:
    - Cache-aside pattern (app checks Redis first)
    - TTL: 1 hour (configurable per key type)
    - Session: 24 hours with refresh on activity
    - Metrics: 5 minutes

Monitoring:
    - CPU and memory utilization
    - Key eviction rate
    - Network bytes in/out
    - Cache hit/miss ratio (tracked in application)

ADR-004: Terraform for Infrastructure as Code

Date: January 2025
Status: Accepted
Deciders: Infrastructure Lead, CloudOps Team

Context

Sparki requires infrastructure management across:
  • Multiple AWS regions and availability zones
  • Kubernetes clusters, networking, IAM, databases
  • Consistent reproducible deployments
  • Change tracking and auditability
  • Disaster recovery and environment parity

Decision

Use Terraform as the primary Infrastructure as Code (IaC) tool.

Rationale

Advantages:
  • Multi-Cloud: Supports AWS, GCP, Azure, Kubernetes with same syntax
  • Declarative: Define desired state, Terraform handles implementation
  • Plan Before Apply: Review changes before execution (terraform plan)
  • State Management: Track resource relationships and state
  • Modules: Reusable components across projects
  • Community: Largest provider ecosystem (1000+ providers)
Why Not Alternatives:
  • CloudFormation: AWS-only, limited multi-cloud support
  • Pulumi: Good but requires programming language expertise
  • CDK: AWS-only, language-specific (TypeScript/Python)
  • Ansible: Imperative (how), not declarative (what)

Consequences

Positive:
  • ✅ All infrastructure in version control
  • ✅ Consistent deployments across environments
  • ✅ Easy to scale to multiple regions
  • ✅ Clear audit trail of infrastructure changes
Negative:
  • ❌ Terraform state management complexity
  • ❌ Learning curve for team members
  • ❌ Terraform bugs can cause cascading failures

Implementation Notes

# Project Structure
terraform/
  ├── environments/
  │   ├── dev/
  │   ├── staging/
  │   └── production/
  ├── modules/
  │   ├── vpc/
  │   ├── eks/
  │   ├── rds/
  │   ├── networking/
  │   ├── iam/
  │   └── security/
  └── shared/
      └── backend.tf (state management)

# Key Modules Implemented
- pod-security-standards: Kubernetes PSS enforcement
- encryption-at-rest: KMS keys, vault, secrets encryption
- encryption-in-transit: cert-manager, TLS, mTLS
- compliance-automation: Auditing, monitoring, scanning

# State Management
- Remote state in S3 with DynamoDB locking
- Encrypted at-rest (AWS KMS)
- Versioning enabled for rollback

ADR-005: Multi-Region High Availability Architecture

Date: February 2025
Status: Accepted
Deciders: Architecture Committee, Reliability Lead

Context

Sparki targets global users and must support:
  • 99.99% uptime SLA (four nines)
  • Sub-second latency globally
  • Automatic failover during regional outages
  • Data consistency across regions
  • Compliance with data residency requirements

Decision

Deploy active-passive multi-region architecture with automatic failover.

Rationale

Design:
Primary Region (us-east-1):
  - Primary EKS cluster (active)
  - Primary RDS (read-write)
  - Redis primary
  - Application services

Secondary Region (us-west-2):
  - Standby EKS cluster (active-active for read-only)
  - RDS read replica (async replication)
  - Redis replica
  - Read-only service instances

Global Layer:
  - Route53 for DNS failover
  - CloudFront for static content CDN
  - Inter-region networking (VPC peering/Transit Gateway)
Advantages:
  • High Availability: Tolerates full regional failure
  • Fast Recovery: RTO ~5 minutes via automated failover
  • Data Protection: Replication provides backup
  • Reduced Latency: Read-only access from nearest region
  • Compliance: Can maintain data residency requirements
Why Not Active-Active:
  • Complexity: Distributed transactions, eventual consistency issues
  • Cost: Double infrastructure cost
  • Data Consistency: CAP theorem trade-offs

Consequences

Positive:
  • ✅ Meets 99.99% uptime requirement
  • ✅ Single region failure doesn’t impact users
  • ✅ Read scaling to secondary region
Negative:
  • ❌ Significant infrastructure cost (2x)
  • ❌ Complex disaster recovery procedures
  • ❌ Data consistency challenges for writes

Implementation Notes

Failover Mechanism:
    - Health checks every 30 seconds
    - Route53 automatic failover DNS
    - RDS read replica promotion to primary
    - Redis replication monitoring

Data Replication:
    - RDS: Async cross-region replication
    - Redis: Master-slave replication
    - Application: Event streaming to secondary (Kafka)
    - Backups: Stored in both regions

ADR-006: GitHub Actions for CI/CD

Date: February 2025
Status: Accepted
Deciders: DevOps, Engineering Leads

Context

Sparki requires automated:
  • Testing on every commit (unit, integration, e2e)
  • Building and pushing Docker images
  • Deploying to Kubernetes environments
  • Security scanning and compliance checks
  • Release management and versioning

Decision

Use GitHub Actions as the primary CI/CD platform.

Rationale

Advantages:
  • GitHub Native: Deep integration with repositories, PRs, releases
  • Free: Generous free tier for public/private repos
  • Self-Hosted: Can use self-hosted runners for custom environments
  • Ecosystem: 15,000+ pre-built actions available
  • Scalability: Matrix builds across multiple OS/versions
  • Secrets Management: Integrated secrets storage and rotation
Why Not Alternatives:
  • Jenkins: Powerful but requires managing servers
  • GitLab CI: Good but requires GitLab (different ecosystem)
  • CircleCI: Good but additional monthly cost
  • Travis CI: Declining market share, higher pricing

Consequences

Positive:
  • ✅ No separate infrastructure to manage
  • ✅ Fast builds with caching
  • ✅ Easy integration with GitHub (PRs, releases)
  • ✅ Cost-effective for team
Negative:
  • ❌ Vendor lock-in to GitHub
  • ❌ Limited customization vs self-hosted
  • ❌ Minute limits on free tier

Implementation Notes

# Workflow Structure
.github/workflows/
  ├── test.yml           # Run tests on PR
  ├── build.yml          # Build images on merge
  ├── deploy.yml         # Deploy to staging/prod
  ├── security-scan.yml  # SAST/DAST scanning
  └── release.yml        # Tag and release

# Pipeline Stages
1. Lint & Format Check (golangci-lint, prettier)
2. Unit Tests (Go, TypeScript, Python)
3. Integration Tests (Docker Compose)
4. E2E Tests (Playwright)
5. Security Scans (SonarQube, Trivy, Snyk)
6. Build & Push Images (ECR)
7. Deploy to Kubernetes (Terraform, ArgoCD)
8. Smoke Tests (verify deployment)

ADR-007: gRPC + REST API Gateway

Date: February 2025
Status: Accepted
Deciders: API Design, Backend Architecture

Context

Sparki services communicate through:
  • External REST APIs (web, mobile, integrations)
  • Internal service-to-service communication
  • Real-time APIs (WebSockets, Server-Sent Events)
  • Mobile push notifications and webhooks

Decision

Use gRPC for internal service communication with REST gateway for external APIs.

Rationale

┌─────────────────────────────────────────────┐
│          External Clients                   │
│   (Web, Mobile, Third-party integrations)   │
└──────────────┬──────────────────────────────┘
               │ REST/gRPC-Web

       ┌─────────────────────┐
       │   API Gateway       │
       │  (Envoy/Kong)       │
       └──────────┬──────────┘
                  │ gRPC
      ┌───────────┼───────────┐
      ▼           ▼           ▼
   ┌─────┐   ┌──────┐   ┌─────┐
   │User │   │Project│  │Service│
   │Svc  │   │  Svc  │  │  Svc │
   └─────┘   └──────┘   └─────┘
Advantages:
  • gRPC Internal: 7x faster than REST, strongly-typed, multiplexing
  • Protocol Buffers: Language-agnostic, excellent for polyglot services
  • REST External: Broad compatibility, familiar to integrations
  • Gateway Pattern: Decouples internal and external APIs
  • Streaming: Native support for Server-Sent Events, WebSockets
Why Not REST Everywhere:
  • Overhead: JSON encoding, HTTP/1.1 overhead
  • Performance: Less suitable for high-frequency service calls
  • Type Safety: No schema enforcement without additional tooling

Consequences

Positive:
  • ✅ Excellent internal performance and efficiency
  • ✅ Language-agnostic service definitions
  • ✅ Streaming capabilities (real-time updates)
  • ✅ Better resource utilization
Negative:
  • ❌ Learning curve for gRPC and Protocol Buffers
  • ❌ Debugging gRPC (binary protocol)
  • ❌ Gateway adds latency and complexity

Implementation Notes

// Example service definition
syntax = "proto3";

service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListUsers(ListUsersRequest) returns (ListUsersResponse);
  rpc CreateUser(CreateUserRequest) returns (User);
  rpc StreamUserUpdates(Empty) returns (stream UserEvent);
}

message User {
  string id = 1;
  string email = 2;
  string name = 3;
}

ADR-008: Microservices Architecture Pattern

Date: February 2025
Status: Accepted
Deciders: Architecture Team, CTO

Context

Sparki has diverse workloads:
  • User management and authentication
  • Project and service management
  • Infrastructure provisioning and monitoring
  • Real-time notifications and events
  • Analytics and reporting
Each requires different scaling, technology, and deployment patterns.

Decision

Adopt microservices architecture with service mesh (Istio).

Rationale

Service Decomposition:
┌──────────────────────────────────────────────────────┐
│              Sparki Microservices                    │
├──────────────────────────────────────────────────────┤
│                                                      │
│  API Gateway  Auth Service  User Service            │
│                                                      │
│  Project Service  Infrastructure Service             │
│                                                      │
│  Notification Service  Analytics Service             │
│                                                      │
│  Webhook Service  Event Bus                          │
│                                                      │
├──────────────────────────────────────────────────────┤
│  Istio Service Mesh (mTLS, load balancing)          │
├──────────────────────────────────────────────────────┤
│         Kubernetes (orchestration, networking)      │
└──────────────────────────────────────────────────────┘
Advantages:
  • Independent Scaling: Scale services based on load
  • Technology Diversity: Use best tool per service
  • Fault Isolation: One service failure doesn’t cascade
  • Deployment Independence: Deploy services separately
  • Team Autonomy: Teams own their services
Why Not Monolith:
  • Scaling Inflexibility: Scale entire app or nothing
  • Technology Lock-in: All services use same stack
  • Deployment Coupling: Minute changes require full redeploy
  • Team Bottlenecks: All teams modify shared codebase

Consequences

Positive:
  • ✅ High scalability and performance
  • ✅ Technology flexibility per service
  • ✅ Team independence and faster development
  • ✅ Fault tolerance and resilience
Negative:
  • ❌ Operational complexity (debugging distributed systems)
  • ❌ Network latency between services
  • ❌ Data consistency challenges
  • ❌ Requires strong DevOps culture

Implementation Notes

Service Catalog:
    - api-gateway: Envoy, handles routing
    - auth-service: JWT/OAuth, session management
    - user-service: User CRUD, profiles
    - project-service: Projects, teams, permissions
    - infrastructure-service: Kubernetes resources, provisioning
    - notification-service: Email, Slack, push
    - analytics-service: Metrics, dashboards, reports

Communication:
    - Synchronous: gRPC (internal), REST (external)
    - Asynchronous: Kafka event bus
    - Real-time: WebSockets via notification service

Observability:
    - Traces: Jaeger (distributed tracing)
    - Metrics: Prometheus (time-series)
    - Logs: ELK stack (centralized logging)
    - Health: Istio + Kubernetes health probes

Decision Log

ADRTitleStatusDateImpact
001Kubernetes for Container OrchestrationAccepted2025-01-15High
002PostgreSQL for Primary Data StoreAccepted2025-01-15High
003Redis for Caching LayerAccepted2025-01-15Medium
004Terraform for Infrastructure as CodeAccepted2025-01-15High
005Multi-Region HA ArchitectureAccepted2025-02-01High
006GitHub Actions for CI/CDAccepted2025-02-01Medium
007gRPC + REST API GatewayAccepted2025-02-01High
008Microservices Architecture PatternAccepted2025-02-01High

Contributing to ADRs

Process for New ADRs

  1. Identify Decision: Architecture change affecting multiple services
  2. Create ADR: Use MADR 3.0 template
  3. Discussion: Present to architecture committee
  4. Acceptance: Requires consensus from affected leads
  5. Documentation: Add to this index, version control

Template

# ADR-NNN: Title

**Date:** YYYY-MM-DD  
**Status:** Proposed | Accepted | Deprecated  
**Deciders:** Names of decision makers

## Context

## Decision

## Rationale

## Consequences

## Implementation Notes

References