Architecture Decision Records (ADRs)

Index: Sparki Architectural Decisions
Format: MADR 3.0

ADR-001: Kubernetes for Container Orchestration

Date: January 2025
Status: Accepted
Deciders: Architecture Team, DevOps Lead

Context

Sparki needed a container orchestration platform supporting:

Auto-scaling and load balancing
Rolling deployments and rollbacks
Multi-region deployment
Self-healing and high availability
Cost optimization for cloud infrastructure

Decision

Adopt Kubernetes (EKS on AWS) as the primary container orchestration platform.

Rationale

Advantages:

Industry Standard: Kubernetes is the de facto standard with massive community support
Scalability: Handles 100+ services across multiple regions seamlessly
Self-Healing: Automatic pod restart, health checking, and recovery
Rolling Updates: Zero-downtime deployments with automatic rollback
Multi-Cloud Ready: Can migrate to GKE or AKS without application changes
Cost Optimization: Spot instances, autoscaling, resource optimization tools
Declarative Configuration: Infrastructure as Code via manifests/Helm

Why Not Alternatives:

Docker Swarm: Limited to single datacenter, less mature ecosystem
ECS: AWS-only, vendor lock-in, less flexible than Kubernetes
Nomad: Good but smaller ecosystem, fewer third-party tools
CloudRun: Serverless-only, no multi-region control, startup latency issues

Consequences

Positive:

✅ Attracts talent familiar with Kubernetes
✅ Rich ecosystem of tools (Helm, ArgoCD, Istio, etc.)
✅ Strong vendor support (AWS EKS, managed control plane)
✅ Excellent multi-region and HA capabilities

Negative:

❌ Operational complexity (learning curve, troubleshooting)
❌ Additional cost for managed Kubernetes vs. simpler alternatives
❌ Requires skilled DevOps/SRE team

Implementation Notes

# Cluster Configuration
- AWS EKS (Kubernetes 1.27+)
- 10 worker nodes, t3.large instances
- Multi-AZ deployment (3 availability zones)
- Auto-scaling: 10-50 nodes based on load
- Pod Security Standards: restricted profile enforced
- Network Policy: Enabled, default deny egress/ingress

ADR-002: PostgreSQL for Primary Data Store

Date: January 2025
Status: Accepted
Deciders: Data Lead, Backend Architects

Context

Sparki handles user accounts, project data, audit logs, and operational metadata requiring:

ACID compliance for data integrity
Complex queries and reporting
Strong consistency guarantees
Audit trail capabilities
Schema evolution support

Decision

Use PostgreSQL as the primary relational database.

Rationale

Advantages:

ACID Compliance: Strong consistency guarantees for financial/audit data
Advanced Features: JSON/JSONB, arrays, range types, full-text search
Performance: Excellent query performance with proper indexing
Reliability: Used in production at scale by Fortune 500 companies
Open Source: Community-driven, no vendor lock-in
Security: Row-level security, encryption at rest/in transit supported
Replication: Native streaming replication for HA and read scaling

Why Not Alternatives:

MySQL: Less advanced features, weaker JSON support
MongoDB: CAP theorem trade-offs (eventual consistency), schema-less issues at scale
DynamoDB: Serverless advantage offset by high costs and eventual consistency
Cassandra: Eventual consistency, complex operational model

Consequences

Positive:

✅ Excellent for operational data and analytics
✅ Strong consistency prevents data corruption
✅ Mature backup and recovery tools
✅ Good cost/performance ratio at scale

Negative:

❌ Vertical scaling limits (need read replicas for horizontal)
❌ Schema changes require careful planning at large scale
❌ Replication lag on read replicas (eventual consistency for reads)

Implementation Notes

# PostgreSQL Configuration
Primary:
    - RDS Multi-AZ (us-east-1 and us-west-2)
    - Version: PostgreSQL 15
    - Instance: db.r6i.2xlarge
    - Storage: 500GB gp3
    - Backup: Daily automated, 30-day retention
    - Encryption: At-rest (AWS KMS), in-transit (SSL)

Read Replicas:
    - 2 replicas per region for read scaling
    - Async replication (monitoring for lag)
    - Cross-region replicas for disaster recovery

Connection Pooling:
    - PgBouncer (transaction pooling)
    - Max connections: 200 per replica
    - Pool size: 50 per application pod

ADR-003: Redis for Caching Layer

Date: January 2025
Status: Accepted
Deciders: Platform Architecture, Performance Team

Context

Sparki experiences cache-heavy access patterns:

Session/authentication tokens
Project metadata (frequently accessed)
Rate limit counters
Real-time metrics aggregation
Temporary computation results

Decision

Use Redis Cluster with Sentinel for high availability caching.

Rationale

Advantages:

Performance: Sub-millisecond response times (in-memory)
Data Structures: Native support for strings, lists, sets, sorted sets, streams
High Availability: Redis Sentinel for automatic failover
Replication: Master-slave replication across AZs
Persistence: Optional persistence (RDB snapshots, AOF logs)
Pub/Sub: Real-time messaging for WebSocket support

Why Not Alternatives:

Memcached: No persistence, no HA without external tooling
DynamoDB: Much slower for cache use case, unnecessary overhead
Elasticsearch: Overkill for simple caching, write-heavy
Local Memory: Single-node failure = data loss, no sharing across pods

Consequences

Positive:

✅ Dramatic performance improvement (10-100x faster than DB)
✅ Reduces database load and costs
✅ Enables real-time features (Pub/Sub, WebSockets)

Negative:

❌ Additional operational complexity (Sentinel monitoring)
❌ Increased infrastructure cost
❌ Memory limits (can’t cache entire dataset)

Implementation Notes

# Redis Configuration
Cluster Setup:
    - Redis Cluster (6 nodes, 3 replicas)
    - ElastiCache (Redis 7.0)
    - Node Type: cache.r6g.xlarge
    - Automatic failover enabled
    - Multi-AZ deployment

Caching Strategy:
    - Cache-aside pattern (app checks Redis first)
    - TTL: 1 hour (configurable per key type)
    - Session: 24 hours with refresh on activity
    - Metrics: 5 minutes

Monitoring:
    - CPU and memory utilization
    - Key eviction rate
    - Network bytes in/out
    - Cache hit/miss ratio (tracked in application)

ADR-004: Terraform for Infrastructure as Code

Date: January 2025
Status: Accepted
Deciders: Infrastructure Lead, CloudOps Team

Context

Sparki requires infrastructure management across:

Multiple AWS regions and availability zones
Kubernetes clusters, networking, IAM, databases
Consistent reproducible deployments
Change tracking and auditability
Disaster recovery and environment parity

Decision

Use Terraform as the primary Infrastructure as Code (IaC) tool.

Rationale

Advantages:

Multi-Cloud: Supports AWS, GCP, Azure, Kubernetes with same syntax
Declarative: Define desired state, Terraform handles implementation
Plan Before Apply: Review changes before execution (terraform plan)
State Management: Track resource relationships and state
Modules: Reusable components across projects
Community: Largest provider ecosystem (1000+ providers)

Why Not Alternatives:

CloudFormation: AWS-only, limited multi-cloud support
Pulumi: Good but requires programming language expertise
CDK: AWS-only, language-specific (TypeScript/Python)
Ansible: Imperative (how), not declarative (what)

Consequences

Positive:

✅ All infrastructure in version control
✅ Consistent deployments across environments
✅ Easy to scale to multiple regions
✅ Clear audit trail of infrastructure changes

Negative:

❌ Terraform state management complexity
❌ Learning curve for team members
❌ Terraform bugs can cause cascading failures

Implementation Notes

# Project Structure
terraform/
  ├── environments/
  │   ├── dev/
  │   ├── staging/
  │   └── production/
  ├── modules/
  │   ├── vpc/
  │   ├── eks/
  │   ├── rds/
  │   ├── networking/
  │   ├── iam/
  │   └── security/
  └── shared/
      └── backend.tf (state management)

# Key Modules Implemented
- pod-security-standards: Kubernetes PSS enforcement
- encryption-at-rest: KMS keys, vault, secrets encryption
- encryption-in-transit: cert-manager, TLS, mTLS
- compliance-automation: Auditing, monitoring, scanning

# State Management
- Remote state in S3 with DynamoDB locking
- Encrypted at-rest (AWS KMS)
- Versioning enabled for rollback

ADR-005: Multi-Region High Availability Architecture

Date: February 2025
Status: Accepted
Deciders: Architecture Committee, Reliability Lead

Context

Sparki targets global users and must support:

99.99% uptime SLA (four nines)
Sub-second latency globally
Automatic failover during regional outages
Data consistency across regions
Compliance with data residency requirements

Decision

Deploy active-passive multi-region architecture with automatic failover.

Rationale

Design:

Primary Region (us-east-1):
  - Primary EKS cluster (active)
  - Primary RDS (read-write)
  - Redis primary
  - Application services

Secondary Region (us-west-2):
  - Standby EKS cluster (active-active for read-only)
  - RDS read replica (async replication)
  - Redis replica
  - Read-only service instances

Global Layer:
  - Route53 for DNS failover
  - CloudFront for static content CDN
  - Inter-region networking (VPC peering/Transit Gateway)

Advantages:

High Availability: Tolerates full regional failure
Fast Recovery: RTO ~5 minutes via automated failover
Data Protection: Replication provides backup
Reduced Latency: Read-only access from nearest region
Compliance: Can maintain data residency requirements

Why Not Active-Active:

Complexity: Distributed transactions, eventual consistency issues
Cost: Double infrastructure cost
Data Consistency: CAP theorem trade-offs

Consequences

Positive:

✅ Meets 99.99% uptime requirement
✅ Single region failure doesn’t impact users
✅ Read scaling to secondary region

Negative:

❌ Significant infrastructure cost (2x)
❌ Complex disaster recovery procedures
❌ Data consistency challenges for writes

Implementation Notes

Failover Mechanism:
    - Health checks every 30 seconds
    - Route53 automatic failover DNS
    - RDS read replica promotion to primary
    - Redis replication monitoring

Data Replication:
    - RDS: Async cross-region replication
    - Redis: Master-slave replication
    - Application: Event streaming to secondary (Kafka)
    - Backups: Stored in both regions

ADR-006: GitHub Actions for CI/CD

Date: February 2025
Status: Accepted
Deciders: DevOps, Engineering Leads

Context

Sparki requires automated:

Testing on every commit (unit, integration, e2e)
Building and pushing Docker images
Deploying to Kubernetes environments
Security scanning and compliance checks
Release management and versioning

Decision

Use GitHub Actions as the primary CI/CD platform.

Rationale

Advantages:

GitHub Native: Deep integration with repositories, PRs, releases
Free: Generous free tier for public/private repos
Self-Hosted: Can use self-hosted runners for custom environments
Ecosystem: 15,000+ pre-built actions available
Scalability: Matrix builds across multiple OS/versions
Secrets Management: Integrated secrets storage and rotation

Why Not Alternatives:

Jenkins: Powerful but requires managing servers
GitLab CI: Good but requires GitLab (different ecosystem)
CircleCI: Good but additional monthly cost
Travis CI: Declining market share, higher pricing

Consequences

Positive:

✅ No separate infrastructure to manage
✅ Fast builds with caching
✅ Easy integration with GitHub (PRs, releases)
✅ Cost-effective for team

Negative:

❌ Vendor lock-in to GitHub
❌ Limited customization vs self-hosted
❌ Minute limits on free tier

Implementation Notes

# Workflow Structure
.github/workflows/
  ├── test.yml           # Run tests on PR
  ├── build.yml          # Build images on merge
  ├── deploy.yml         # Deploy to staging/prod
  ├── security-scan.yml  # SAST/DAST scanning
  └── release.yml        # Tag and release

# Pipeline Stages
1. Lint & Format Check (golangci-lint, prettier)
2. Unit Tests (Go, TypeScript, Python)
3. Integration Tests (Docker Compose)
4. E2E Tests (Playwright)
5. Security Scans (SonarQube, Trivy, Snyk)
6. Build & Push Images (ECR)
7. Deploy to Kubernetes (Terraform, ArgoCD)
8. Smoke Tests (verify deployment)

ADR-007: gRPC + REST API Gateway

Date: February 2025
Status: Accepted
Deciders: API Design, Backend Architecture

Context

Sparki services communicate through:

External REST APIs (web, mobile, integrations)
Internal service-to-service communication
Real-time APIs (WebSockets, Server-Sent Events)
Mobile push notifications and webhooks

Decision

Use gRPC for internal service communication with REST gateway for external APIs.

Rationale

┌─────────────────────────────────────────────┐
│          External Clients                   │
│   (Web, Mobile, Third-party integrations)   │
└──────────────┬──────────────────────────────┘
               │ REST/gRPC-Web
               ▼
       ┌─────────────────────┐
       │   API Gateway       │
       │  (Envoy/Kong)       │
       └──────────┬──────────┘
                  │ gRPC
      ┌───────────┼───────────┐
      ▼           ▼           ▼
   ┌─────┐   ┌──────┐   ┌─────┐
   │User │   │Project│  │Service│
   │Svc  │   │  Svc  │  │  Svc │
   └─────┘   └──────┘   └─────┘

Advantages:

gRPC Internal: 7x faster than REST, strongly-typed, multiplexing
Protocol Buffers: Language-agnostic, excellent for polyglot services
REST External: Broad compatibility, familiar to integrations
Gateway Pattern: Decouples internal and external APIs
Streaming: Native support for Server-Sent Events, WebSockets

Why Not REST Everywhere:

Overhead: JSON encoding, HTTP/1.1 overhead
Performance: Less suitable for high-frequency service calls
Type Safety: No schema enforcement without additional tooling

Consequences

Positive:

✅ Excellent internal performance and efficiency
✅ Language-agnostic service definitions
✅ Streaming capabilities (real-time updates)
✅ Better resource utilization

Negative:

❌ Learning curve for gRPC and Protocol Buffers
❌ Debugging gRPC (binary protocol)
❌ Gateway adds latency and complexity

Implementation Notes

// Example service definition
syntax = "proto3";

service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListUsers(ListUsersRequest) returns (ListUsersResponse);
  rpc CreateUser(CreateUserRequest) returns (User);
  rpc StreamUserUpdates(Empty) returns (stream UserEvent);
}

message User {
  string id = 1;
  string email = 2;
  string name = 3;
}

ADR-008: Microservices Architecture Pattern

Date: February 2025
Status: Accepted
Deciders: Architecture Team, CTO

Context

Sparki has diverse workloads:

User management and authentication
Project and service management
Infrastructure provisioning and monitoring
Real-time notifications and events
Analytics and reporting

Each requires different scaling, technology, and deployment patterns.

Decision

Adopt microservices architecture with service mesh (Istio).

Rationale

Service Decomposition:

┌──────────────────────────────────────────────────────┐
│              Sparki Microservices                    │
├──────────────────────────────────────────────────────┤
│                                                      │
│  API Gateway  Auth Service  User Service            │
│                                                      │
│  Project Service  Infrastructure Service             │
│                                                      │
│  Notification Service  Analytics Service             │
│                                                      │
│  Webhook Service  Event Bus                          │
│                                                      │
├──────────────────────────────────────────────────────┤
│  Istio Service Mesh (mTLS, load balancing)          │
├──────────────────────────────────────────────────────┤
│         Kubernetes (orchestration, networking)      │
└──────────────────────────────────────────────────────┘

Advantages:

Independent Scaling: Scale services based on load
Technology Diversity: Use best tool per service
Fault Isolation: One service failure doesn’t cascade
Deployment Independence: Deploy services separately
Team Autonomy: Teams own their services

Why Not Monolith:

Scaling Inflexibility: Scale entire app or nothing
Technology Lock-in: All services use same stack
Deployment Coupling: Minute changes require full redeploy
Team Bottlenecks: All teams modify shared codebase

Consequences

Positive:

✅ High scalability and performance
✅ Technology flexibility per service
✅ Team independence and faster development
✅ Fault tolerance and resilience

Negative:

❌ Operational complexity (debugging distributed systems)
❌ Network latency between services
❌ Data consistency challenges
❌ Requires strong DevOps culture

Implementation Notes

Service Catalog:
    - api-gateway: Envoy, handles routing
    - auth-service: JWT/OAuth, session management
    - user-service: User CRUD, profiles
    - project-service: Projects, teams, permissions
    - infrastructure-service: Kubernetes resources, provisioning
    - notification-service: Email, Slack, push
    - analytics-service: Metrics, dashboards, reports

Communication:
    - Synchronous: gRPC (internal), REST (external)
    - Asynchronous: Kafka event bus
    - Real-time: WebSockets via notification service

Observability:
    - Traces: Jaeger (distributed tracing)
    - Metrics: Prometheus (time-series)
    - Logs: ELK stack (centralized logging)
    - Health: Istio + Kubernetes health probes

Decision Log

ADR	Title	Status	Date	Impact
001	Kubernetes for Container Orchestration	Accepted	2025-01-15	High
002	PostgreSQL for Primary Data Store	Accepted	2025-01-15	High
003	Redis for Caching Layer	Accepted	2025-01-15	Medium
004	Terraform for Infrastructure as Code	Accepted	2025-01-15	High
005	Multi-Region HA Architecture	Accepted	2025-02-01	High
006	GitHub Actions for CI/CD	Accepted	2025-02-01	Medium
007	gRPC + REST API Gateway	Accepted	2025-02-01	High
008	Microservices Architecture Pattern	Accepted	2025-02-01	High

Contributing to ADRs

Process for New ADRs

Identify Decision: Architecture change affecting multiple services
Create ADR: Use MADR 3.0 template
Discussion: Present to architecture committee
Acceptance: Requires consensus from affected leads
Documentation: Add to this index, version control

Template

# ADR-NNN: Title

**Date:** YYYY-MM-DD  
**Status:** Proposed | Accepted | Deprecated  
**Deciders:** Names of decision makers

## Context

## Decision

## Rationale

## Consequences

## Implementation Notes

​Architecture Decision Records (ADRs)

​Navigation

​ADR-001: Kubernetes for Container Orchestration

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​ADR-002: PostgreSQL for Primary Data Store

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​ADR-003: Redis for Caching Layer

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​ADR-004: Terraform for Infrastructure as Code

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​ADR-005: Multi-Region High Availability Architecture

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​ADR-006: GitHub Actions for CI/CD

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​ADR-007: gRPC + REST API Gateway

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​ADR-008: Microservices Architecture Pattern

​Context

​Decision

​Rationale

​Consequences

​Implementation Notes

​Decision Log

​Contributing to ADRs

​Process for New ADRs

​Template

​References

Architecture Decision Records (ADRs)

Navigation

ADR-001: Kubernetes for Container Orchestration

Context

Decision

Rationale

Consequences

Implementation Notes

ADR-002: PostgreSQL for Primary Data Store

Context

Decision

Rationale

Consequences

Implementation Notes

ADR-003: Redis for Caching Layer

Context

Decision

Rationale

Consequences

Implementation Notes

ADR-004: Terraform for Infrastructure as Code

Context

Decision

Rationale

Consequences

Implementation Notes

ADR-005: Multi-Region High Availability Architecture

Context

Decision

Rationale

Consequences

Implementation Notes

ADR-006: GitHub Actions for CI/CD

Context

Decision

Rationale

Consequences

Implementation Notes

ADR-007: gRPC + REST API Gateway

Context

Decision

Rationale

Consequences

Implementation Notes

ADR-008: Microservices Architecture Pattern

Context

Decision

Rationale

Consequences

Implementation Notes

Decision Log

Contributing to ADRs

Process for New ADRs

Template

References