> ## Documentation Index
> Fetch the complete documentation index at: https://atlas.devarno.cloud/llms.txt
> Use this file to discover all available pages before exploring further.

# SKILL k8s helm deployment

# SKILL: Kubernetes & Helm Infrastructure Validation Framework

**Version**: 1.0\
**Domain**: Infrastructure / Kubernetes / Helm\
**Complexity**: Advanced\
**Time to Proficiency**: 4-6 hours\
**Success Criteria**: Ability to validate, document, and deploy K8s infrastructure with HA/scaling configurations

***

## Overview

This skill captures the complete framework for validating Kubernetes manifests and Helm charts through structured testing, documentation, and verification. Used to implement BLOCK 15 (Kubernetes & Helm Production Deployment) for the Traceo MCP Server.

### Key Concepts

**Dual Deployment Systems**: Maintain both raw Kubernetes manifests (via Kustomize) and Helm charts for flexibility and redundancy.

**Structured Validation**: Use Python YAML validation, Kustomize builds, manifest consistency checks, and acceptance criteria verification.

**High Availability Pattern**: Configure HPA (CPU/memory-based), PDB (graceful disruptions), proper resource requests/limits, and health checks.

**Documentation-Driven Development**: Create validation reports and scaling guides as first-class artifacts, not afterthoughts.

***

## Task Breakdown

### 1. Manifest Validation Foundation

**Objective**: Ensure all YAML files are syntactically correct and structurally sound.

**Steps**:

1. Use Python's `yaml.safe_load_all()` to parse all manifest files
2. Check for required Kubernetes fields (kind, metadata, spec)
3. Validate resource references (labels, selectors, ports)
4. Verify namespace consistency across overlays
5. Check replica counts and resource requests/limits

**Code Pattern**:

```python theme={null}
import yaml
from pathlib import Path

docs = yaml.safe_load_all(open(file))
for doc in docs:
    assert doc.get('kind'), "Missing kind"
    assert doc.get('metadata', {}).get('name'), "Missing name"
```

**Validation Checklist**:

* [ ] No YAML syntax errors (parseable)
* [ ] All resources have kind + metadata.name
* [ ] Label selectors match pod labels
* [ ] Service ports match container ports
* [ ] Resource limits ≥ requests
* [ ] Namespace consistent in overlays

***

### 2. Service Interconnection Mapping

**Objective**: Verify all service-to-service dependencies are properly configured.

**Pattern**:

```python theme={null}
# Map all services
services_config = {}
for svc in services:
    service_file = f"base/{svc}/service.yaml"
    docs = yaml.safe_load_all(open(service_file))
    for doc in docs:
        if doc.get('kind') == 'Service':
            services_config[name] = {
                'ports': doc['spec']['ports'],
                'selector': doc['spec']['selector']
            }

# Verify interconnections
connectivity_matrix = {
    'web': ['mcp-server:8000', 'engine:8001'],
    'engine': ['mcp-server:8000'],
}
for source, targets in connectivity_matrix.items():
    for target in targets:
        assert any(target.split(':')[0] in svc 
                   for svc in services_config.keys())
```

**Validation Checklist**:

* [ ] All Deployments have matching Services
* [ ] Service selectors match pod labels
* [ ] Port numbers consistent across deployment/service
* [ ] Connectivity matrix verified
* [ ] No orphaned services or deployments

***

### 3. High Availability Configuration

**Objective**: Configure HPA and PDB for production resilience.

**HPA Pattern**:

```yaml theme={null}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: service-hpa
spec:
  scaleTargetRef:
    kind: Deployment
    name: service
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
```

**PDB Pattern**:

```yaml theme={null}
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: service-pdb
spec:
  minAvailable: 2  # HPA minReplicas - 1
  selector:
    matchLabels:
      app: service-name
```

**Validation Rules**:

* [ ] minAvailable \< minReplicas (allows disruptions)
* [ ] maxReplicas > minReplicas by 2-5x
* [ ] CPU target 70% (conservative, safe margin)
* [ ] Memory target 80% (higher than CPU)
* [ ] Scale-up faster than scale-down (aggressive vs. conservative)

***

### 4. Resource Request Alignment

**Objective**: Ensure resource requests match HPA scaling triggers.

**Calculation Formula**:

```
Scale-up triggers when:
  Pod CPU Usage > (Request × Target%)
  Pod Memory Usage > (Request × Target%)

Example:
  Request: 100m CPU, 256Mi memory
  CPU Target: 70%
  Memory Target: 80%
  
  Triggers at: 70m CPU OR 204Mi memory
```

**Validation Pattern**:

```python theme={null}
for deployment in deployments:
    containers = deployment['spec']['template']['spec']['containers']
    resources = containers[0]['resources']
    requests = resources['requests']
    
    hpa = hpa_config[deployment_name]
    metrics = hpa['spec']['metrics']
    
    for metric in metrics:
        if metric['type'] == 'Resource':
            resource_name = metric['resource']['name']
            target_util = metric['resource']['target']['averageUtilization']
            
            # Verify request is reasonable for this target
            threshold = parse_value(requests[resource_name]) * target_util / 100
            assert threshold > 0, f"Request too small for {resource_name}"
```

**Validation Checklist**:

* [ ] All containers have resource requests
* [ ] All containers have resource limits
* [ ] Limits >= Requests (always)
* [ ] Requests are realistic (based on actual usage)
* [ ] HPA thresholds reasonable for request size

***

### 5. Health Check Configuration

**Objective**: Ensure proper liveness and readiness probes.

**Pattern**:

```yaml theme={null}
livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 10
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
```

**Timing Guidelines**:

* **initialDelaySeconds**: Long enough for service startup (10-15s)
* **periodSeconds**: Balance between responsiveness (10-30s) and load (higher = more load)
* **timeoutSeconds**: Should be \< periodSeconds / 2
* **failureThreshold**: 2-3 failures before action

**Validation Checklist**:

* [ ] All Deployments have liveness probe
* [ ] All Deployments have readiness probe
* [ ] Probes check actual health (not just connectivity)
* [ ] Timing values are reasonable
* [ ] Port names match container port definitions

***

### 6. Documentation Generation

**Objective**: Create production-ready deployment documentation.

**Required Documents**:

1. **Scaling & HA Guide**
   * Per-service HPA configurations
   * Scaling timelines (scale-up/down duration)
   * Real-world scenarios (traffic spike, batch job, maintenance)
   * Troubleshooting section

2. **Deployment Readiness Report**
   * Validation results matrix
   * Resource inventory
   * Deployment instructions
   * Post-deployment verification steps
   * Known issues and recommendations

3. **README Updates**
   * Quick-start deployment commands
   * Verification commands
   * Environment-specific differences

**Template Structure**:

````markdown theme={null}
# Title

## Validation Results
- [ ] Component 1: Status
- [ ] Component 2: Status

## Per-Service Configuration
### Service Name
| Metric | Value |
|--------|-------|
| Min Replicas | 3 |
| Max Replicas | 10 |

## Deployment Instructions
```bash
kubectl apply -k overlays/prod
````

## Verification

```bash theme={null}
kubectl get pods -n namespace
```

````

---

### 7. Kustomize Integration

**Objective**: Validate Kustomize builds and patch application.

**Pattern**:
```bash
# Build base
kustomize build base

# Build with overlay
kustomize build overlays/prod

# Verify output
kustomize build overlays/prod | grep "kind:" | sort | uniq -c
````

**Validation**:

* [ ] Build completes without errors
* [ ] All expected resources present
* [ ] Namespace patching applied
* [ ] Label patching applied
* [ ] Resource patching applied
* [ ] No naming conflicts

***

## Production Checklist

Before deploying to production:

**Infrastructure**:

* [ ] Kubernetes cluster v1.28+
* [ ] Metrics server installed (required for HPA)
* [ ] 3+ nodes for HA
* [ ] Adequate CPU/memory capacity
* [ ] Persistent volume provisioner available

**Configuration**:

* [ ] All placeholder secrets replaced
* [ ] Database URLs configured
* [ ] Environment variables set
* [ ] Image tags verified
* [ ] Resource quotas configured (if multi-tenant)

**Networking**:

* [ ] Ingress controller installed
* [ ] DNS records configured
* [ ] TLS certificates provisioned
* [ ] Network policies configured (optional)

**Monitoring**:

* [ ] Metrics server running
* [ ] Prometheus scraping enabled
* [ ] Alerts configured
* [ ] Logging aggregation setup

***

## Common Pitfalls

| Issue                 | Cause                         | Solution                     |
| --------------------- | ----------------------------- | ---------------------------- |
| HPA not scaling       | Metrics server missing        | Install metrics-server       |
| PDB blocks operations | minAvailable too high         | Set to minReplicas - 1       |
| Pods pending          | Resource requests too high    | Reduce requests or add nodes |
| Slow startup          | initialDelaySeconds too short | Increase to 15-20s           |
| Frequent restarts     | readinessProbe too aggressive | Increase threshold or period |
| OOMKilled pods        | Memory limit too low          | Increase limit by 50%        |

***

## Scaling Scenarios

### Scenario 1: Traffic Spike

```
Initial: 3 pods at 50% CPU
Spike: +100% traffic

Timeline:
t=0s:   Spike detected, CPU → 100%
t=15s:  Scale to 7 pods (100% increase)
t=30s:  Scale to 11 pods (100% increase)
t=45s:  Continue scaling until CPU < 70%
Result: Full scaling in 45-60s
```

### Scenario 2: Batch Processing

```
Initial: 3 pods
Batch starts: 50GB data ingestion

Timeline:
t=0s:     Batch starts, CPU/Memory spike
t=30s:    Memory triggers scale-up
t=60s:    Continue scaling to target
t=300s:   Sustained high utilization
t=600s:   Batch completes, scale-down starts
t=900s+:  Gradual scale-down over 5+ minutes
```

### Scenario 3: Rolling Update

```
Current: 5 pods running
Update triggered:

Step 1: Drain 1 pod (4 pods running)
  - PDB allows this (minAvailable=2 met)
  - Pod updates, restarts
Step 2: Repeat for each pod
  - Service maintains availability throughout
Timeline: ~2-3 min per pod = 10-15 min total
```

***

## Testing & Validation

### Unit Tests

* Manifest syntax validation
* Label/selector matching
* Port consistency
* Resource request validation

### Integration Tests

* Service discovery
* Cross-service connectivity
* Health check endpoints
* ConfigMap/Secret mounting

### Load Tests

* HPA scaling behavior
* PDB disruption handling
* Resource utilization patterns
* Rolling update stability

### Chaos Tests

* Pod failure recovery
* Node drain simulation
* Network latency injection
* Memory pressure simulation

***

## Tools & Commands

### Manifest Validation

```bash theme={null}
# YAML syntax check
python3 -c "import yaml; yaml.safe_load_all(open('manifest.yaml'))"

# Kustomize build
kustomize build overlays/prod

# Dry-run deployment
kubectl apply -k overlays/prod --dry-run=server
```

### Inspection

```bash theme={null}
# List all resources
kubectl get all -n namespace

# Describe deployment
kubectl describe deployment name -n namespace

# Check HPA status
kubectl get hpa -n namespace
kubectl describe hpa name -n namespace

# Check PDB status
kubectl get pdb -n namespace

# View pod logs
kubectl logs -f deployment/name -n namespace
```

### Scaling

```bash theme={null}
# Manual scale
kubectl scale deployment name --replicas=5 -n namespace

# Watch HPA in action
kubectl get hpa -n namespace -w

# Check metrics
kubectl top pods -n namespace
kubectl top nodes
```

***

## References

* [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
* [Pod Disruption Budgets](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)
* [Kustomize Docs](https://kubectl.docs.kubernetes.io/guides/introduction/kustomize/)
* [Helm Docs](https://helm.sh/docs/)

***

## Success Indicators

You've mastered this skill when you can:

1. ✅ Validate complete Kubernetes infrastructure in \< 30 minutes
2. ✅ Design HPA/PDB configurations matching your resource constraints
3. ✅ Troubleshoot scaling and availability issues independently
4. ✅ Create production-ready deployment documentation
5. ✅ Explain tradeoffs between different HA strategies
6. ✅ Optimize resource requests based on actual metrics

***

**Last Updated**: 2026-03-31\
**Next Review**: 2026-06-30
