Skip to main content

FNP - Deployment - Kubernetes Multi-Region Architecture

Summary (Explain Like I’m 5)

Imagine you have a restaurant, and customers are spread worldwide. You can’t handle all orders from a single kitchen. Solution: Multiple kitchens (replicas) in different cities:
  • New York kitchen handles US customers
  • London kitchen handles EU customers
  • Tokyo kitchen handles Asia customers
Each kitchen:
  • Follows same recipe (identical system)
  • Syncs inventory (replication)
  • Serves customers locally (low latency)
  • Handles failures automatically (redundancy)
FNP deployment works the same way, with Kubernetes orchestrating multiple replicas globally.

Technical Deep Dive

Multi-Region Kubernetes Architecture:
┌─────────────────────────────────────────────────────────────┐
│                    DNS & Global Load Balancer               │
└────────────┬────────────────┬────────────────┬──────────────┘
             │                │                │
    ┌────────▼───────┐  ┌─────▼────────┐  ┌───▼──────────────┐
    │  AWS EKS       │  │  GCP GKE     │  │  Azure AKS       │
    │  us-east-1     │  │  europe-west │  │  japaneast       │
    │                │  │                │  │                 │
    │ ┌────────────┐ │  │ ┌──────────┐ │  │ ┌──────────────┐ │
    │ │ FNP Pod 1  │ │  │ │ FNP Pod  │ │  │ │ FNP Pod    │ │
    │ │ FNP Pod 2  │ │  │ │ (replicas)│ │  │ │ (replicas)  │ │
    │ │ FNP Pod 3  │ │  │ │          │ │  │ │              │ │
    │ │ (3 replicas)│ │  │ └──────────┘ │  │ └──────────────┘ │
    │ │            │ │  │ ┌──────────┐ │  │ ┌──────────────┐ │
    │ │ PostgreSQL │ │  │ │Prometheus│ │  │ │  Postgres  │ │
    │ │ + Repl.   │ │  │ │ Monitoring│ │  │ │ Replication │ │
    │ └────────────┘ │  │ └──────────┘ │  │ └──────────────┘ │
    └────────┬───────┘  └─────┬────────┘  └────────┬─────────┘
             │                │                │
    ┌────────▼────────────────▼────────────────▼─────┐
    │  Multi-Region Replication                      │
    │  (PostgreSQL WAL Streaming / DynamoDB Streams)  │
    │                                                 │
    │  Async Replication: RPO ~5 seconds              │
    │  RTO: Automatic failover ~30 seconds            │
    └─────────────────────────────────────────────────┘
Kubernetes Base Configuration:
# Base deployment (all regions)
- Namespace: fnp
- Database: PostgreSQL with 10GB PersistentVolume
- Service: LoadBalancer exposing on :8000
- HPA: Horizontal Pod Autoscaler (3-10 replicas based on CPU)
- ConfigMap: Shared application config
- Secrets: DB credentials, TLS certs
Production Overlay (Regional):
fnp-production:
    namespace: fnp-production
    replicas: 5 # Increased from base 3
    resources:
        cpu: 500m (increased from 250m)
        memory: 1Gi (increased from 512Mi)
    affinity:
        podAntiAffinity: preferred # Spread across nodes
    rateLimit:
        rpm: 10,000
        rph: 100,000
    logging: structured JSON
Service Mesh Integration (Istio):
VirtualService:
    hosts: [fnp.example.com]
    http:
        - match:
              - uri:
                    prefix: "/api/"
          route:
              - destination:
                    host: fnp-service
                    port:
                        number: 8000
                weight: 90 # 90% prod
              - destination:
                    host: fnp-service-canary
                    port:
                        number: 8000
                weight: 10 # 10% canary (Flagger)

DestinationRule:
    host: fnp-service
    trafficPolicy:
        connectionPool:
            tcp:
                maxConnections: 100
            http:
                http1MaxPendingRequests: 50
        outlierDetection:
            consecutive5xxErrors: 5
            interval: 10s
            baseEjectionTime: 30s
Progressive Deployment (Flagger):
Canary:
    targetRef:
        name: fnp
    progressDeadlineSeconds: 300
    service:
        port: 8000
    analysis:
        interval: 30s
        threshold: 5
        maxWeight: 50
        stepWeight: 5 # 5% → 10% → 15% → 50%
        metrics:
            - name: request-success-rate
              thresholdRange:
                  min: 99
            - name: request-duration
              thresholdRange:
                  max: 500 # milliseconds

Mermaid Diagrams

Key Terms

  • Multi-Region → Deployments across AWS, GCP, Azure in different geographic regions
  • Replication → PostgreSQL WAL streaming; async replication RPO ~5 seconds
  • RTO/RPO → Recovery Time Objective (minutes), Recovery Point Objective (data loss window)
  • HPA → Horizontal Pod Autoscaler; automatic scaling 3-10 replicas based on metrics
  • Canary Deployment → Progressive rollout with metrics validation using Flagger
  • Service Mesh → Istio handling: mTLS, traffic management, observability
  • Progressive Delivery → 5% → 10% → 15% → 50% traffic weight shifts
  • Observability → Prometheus (metrics), Jaeger (tracing), Grafana (dashboards)

Q/A

Q: How does a user connect to the nearest FNP server? A: Global load balancer (AWS Route 53, CloudFlare, or GeoDNS) routes traffic based on geographic location. Users in US hit AWS EKS, EU hits GCP GKE, Asia hits Azure AKS. Latency reduced by ~100ms through geographic proximity. Q: What happens if AWS region goes down? A: Health checks detect outage (pod heartbeats fail). Load balancer stops routing new traffic to us-east-1. Existing connections closed. Traffic reroutes to GCP/Azure. ~30 seconds RTO. Data recovered from PostgreSQL multi-region replication (RPO ~5 seconds). Q: How are database writes replicated across regions? A: Primary PostgreSQL in AWS streams Write-Ahead Log (WAL) asynchronously to GCP and Azure replicas. Replicas apply changes in order. Slight lag acceptable (eventual consistency). On primary failure, one replica promoted to primary. Q: Why use StatefulSets for PostgreSQL in Kubernetes? A: StatefulSets ensure: (1) Persistent volume binding to same pod, (2) Stable DNS names for replication, (3) Ordered scaling (don’t lose data), (4) Headless service for replication connections between pods. Q: How does Flagger decide when to rollback? A: Canary analysis runs every 30 seconds. Metrics polled: success-rate (target >99%), latency (target <500ms). If either breaches, canary is marked Failed. Istio traffic reverted to stable version. Pod rolled back. Q: What’s the cost savings from Karpenter and spot instances? A: Karpenter provisions spot instances (80% cheaper) for non-critical workloads. Automatic fallback to on-demand if spot capacity exhausted. Production NodePool (on-demand) + Spot NodePool = ~40-50% overall cost reduction.

Example / Analogy

Airline Network Analogy: FNP multi-region is like a global airline:
  • Hubs (Regions): New York (AWS), London (GCP), Tokyo (Azure) hubs
  • Flights (Replication): Nightly cargo flights sync passenger data
  • Progressive Boarding (Canary): New airplane model tested on 5% of routes before full deployment
  • Failover: If NYC hub goes down, flights reroute through London/Tokyo
  • Scaling: Add more flights during peak hours (HPA scaling)
  • Monitoring: Real-time dashboard tracks all flights, alerts on delays (Prometheus/Grafana)

Cross-References: System Overview, Cost Optimization, Observability, Service Mesh Category: Infrastructure | Deployment | Kubernetes | Operations Difficulty: Advanced ⭐⭐⭐⭐ Updated: 2025-11-28