Skip to main content

FNP - Operations - Cost Optimization & Karpenter Spot Instances

Summary (Explain Like I’m 5)

Running servers costs money. You pay per hour for compute:
  • Premium VM (always on): $2/hour
  • Spot VM (spare capacity): $0.40/hour (80% cheaper!)
Catch: Spot VMs can be taken away if demand spikes. Karpenter solves this: It runs critical work on premium VMs, non-critical work on cheap spot VMs. If spot runs out, work moves to premium automatically. Result: 40-50% cost savings!

Technical Deep Dive

Cost Structure Analysis:
Monthly Infrastructure Cost (1000 concurrent users):

Traditional Static Deployment:
  5 pods × $2/hour × 730 hours = $7,300/month
  PostgreSQL managed: $1,500/month
  Networking/Storage: $500/month
  Total: ~$9,300/month

With Karpenter Optimization:
  Production NodePool: 2 pods × $2/hour = $2,920/month (40% static)
  Spot NodePool: 3 pods × $0.40/hour = $876/month (60% spot)
  PostgreSQL managed: $1,500/month
  Networking/Storage: $500/month
  Total: ~$5,796/month
  Savings: $3,504/month (38%)
Karpenter Architecture:
# Production NodePool (always available, on-demand)
ProductionNodePool:
    instances: [c5.xlarge, c5.2xlarge, m5.xlarge] # CPU-optimized
    min_size: 2
    max_size: 5
    capacity_type: on-demand
    priority: 100 # High priority, guarantees

# Spot NodePool (cost-optimized, can be interrupted)
SpotNodePool:
    instances: [
            c5.xlarge,
            c5.2xlarge,
            m5.xlarge, # Prefer these
            c6i.xlarge,
            m6i.xlarge, # Fallback
            t3.xlarge,
            t4g.xlarge, # Last resort
        ]
    min_size: 0
    max_size: 20
    capacity_type: spot # 80% cheaper
    consolidation:
        enabled: true # Consolidate idle pods
        ttl_seconds: 3600 # Rebalance hourly

PodScheduling:
    critical_workloads: # Insert proofs, user requests
        node_affinity: production_pool
        tolerations: []

    batch_workloads: # Backup, analytics, replication
        node_affinity: spot_pool
        tolerations:
            - key: "karpenter.sh/capacity-type"
              value: "spot"
              effect: NoSchedule
Consolidation & Bin-Packing:
Before Consolidation:
  Node 1: 2 pods, 35% CPU utilization
  Node 2: 1 pod, 15% CPU utilization
  Node 3: 2 pods, 40% CPU utilization
  Total: 3 nodes × $2/hour = $6/hour

Karpenter Consolidation (every hour):
  Node 1: 2 pods, 50% utilization
  Node 2: 0 pods (DELETED)
  Node 3: 3 pods, 65% utilization
  Total: 2 nodes × $2/hour = $4/hour
  Savings: $2/hour!
Cost Monitoring & Alerting:
PrometheusAlert:
    - alert: HighSpotEvictionRate
      expr: rate(karpenter_pod_evictions_total[5m]) > 0.1
      for: 5m
      annotations:
          summary: "Spot instances being evicted; consider scaling up"

    - alert: HighNodeCost
      expr: sum(karpenter_nodes_count * $2) > $100
      for: 1h
      annotations:
          summary: "Hourly node cost exceeds $100; review allocation"

GrafanaDashboard:
    - panel: On-Demand vs Spot Cost
      graph: stacked area chart
      metrics:
          - karpenter_cost_per_hour_on_demand
          - karpenter_cost_per_hour_spot

    - panel: Pod Scheduling Success Rate
      graph: line chart
      target: 100 * (pods_scheduled / pods_pending)
      threshold: 95%
Scaling Policy (HPA + Karpenter):
# 1. User load increases
# 2. HPA creates new pods
# 3. Karpenter detects pending pods
# 4. Karpenter decides: spot or on-demand?
#    - If 0-30% node utilization: add to spot
#    - If 30-60% utilization: prefer spot
#    - If >60% utilization: add on-demand (guaranteed availability)
# 5. Karpenter launches cheapest option
# 6. Pod scheduled, traffic flowing

ScalingSequence:
  1. Metric: requests/sec = 1000 (threshold: 800)
  2. HPA: scale from 5 to 8 pods
  3. Karpenter: detects 3 pending pods
  4. Karpenter: cost optimizer runs
     - Node utilization: 45%
     - Recommendation: add spot instance
  5. Karpenter: launches c5.xlarge spot
  6. Pods scheduled, system stabilizes
  7. 1 hour later: traffic drops
  8. Consolidation: merge pods, delete node

Mermaid Diagrams

Key Terms

  • Spot Instance → AWS spare capacity; 80% cheaper but can be interrupted
  • On-Demand Instance → Guaranteed availability; standard pricing
  • Consolidation → Karpenter automatically merges pods and deletes idle nodes
  • Node Utilization → % of CPU/memory used on node
  • Bin-Packing → Optimize pod placement to minimize nodes needed
  • TTL (Time-to-Live) → Karpenter consolidation checks every 1 hour
  • Capacity Type → on-demand vs spot
  • Cost Optimization → Automated tradeoff between cost and reliability

Q/A

Q: What happens if a spot instance gets interrupted? A: Karpenter has interruption handling: (1) AWS sends 2-minute notice, (2) Karpenter drains pod gracefully (moves to another node), (3) Pod rescheduled on on-demand if needed, (4) Operation succeeds (brief latency spike). Typical RTO ~30 seconds. Q: Can I lose data if spot instances are terminated? A: No. Pods are stateless (database is separate). PostgreSQL runs on managed RDS (guaranteed availability). Pods are replaceable. If pod terminated, work moves to another node. No data loss. Q: How does Karpenter decide between spot and on-demand? A: Uses cost optimization algorithm: (1) Predict pod duration (long = spot risky), (2) Compare cost vs reliability need, (3) Spot for batch/non-critical, on-demand for user-facing. Configurable per workload. Q: Is there a “tipping point” where spot becomes too risky? A: Yes. If interruption rate >5% or latency SLA <99.5%, prefer on-demand. FNP targets 99.99% availability, so 60% spot max recommended. Use on-demand for critical user requests. Q: How are costs tracked and attributed? A: Karpenter exports metrics: karpenter_nodes_cost_per_hour, karpenter_pod_cost. Grafana dashboards visualize. AWS billing integration tags resources by workload. Finance can track cost per feature/customer. Q: What’s the maximum savings possible? A: 80% reduction on compute if 100% spot (risky). Realistic: 40-50% with mix (70% spot, 30% on-demand for critical). Savings compound with: consolidation (-20%), efficient scheduling (-10%), reserved instances (-15% additional).

Example / Analogy

Ride-Share Cost Analogy: Traditional Deployment (Always Premium):
  • Take Uber Black (premium car) every day
  • Cost: 25/trip×50trips/month=25/trip × 50 trips/month = 1,250
  • Always available, never wait
With Karpenter (Mix):
  • UberX (spot/cheap): 70% of trips = 6×35trips=6 × 35 trips = 210
  • Uber Black (on-demand): 30% of trips = 25×15trips=25 × 15 trips = 375
  • Total: $585/month
  • Savings: $665/month (53%)
  • Trade-off: Sometimes UberX temporarily unavailable, reroute to Uber Black

Cross-References: Deployment Architecture, Observability, Scaling Strategy Category: Operations | Cost Optimization | Infrastructure | DevOps Difficulty: Intermediate ⭐⭐⭐ Updated: 2025-11-28