gRPC Error Propagation Strategies: StatusCode Routing for Intelligent Retries

Summary

gRPC status codes signal clients about whether to retry, fail fast, or escalate. Proper error propagation routes validation failures (INVALID_ARGUMENT), not-found errors (NOT_FOUND), and server errors (INTERNAL) to appropriate status codes. The IRIS MERIDIAN adapter demonstrates this in the GenerateCapabilityToken RPC, preventing thundering herd retries and improving reliability.

The Problem

Without explicit status code routing, all errors look identical. A malformed request (typo in sprite_id) and a genuine server failure both become StatusCode.UNKNOWN. Clients treat both as transient failures and retry with exponential backoff, wasting bandwidth and delaying failure detection. Monitoring systems can’t distinguish retryable from permanent errors.

The Solution: StatusCode Routing

Map exception types to gRPC status codes at the RPC boundary:

INVALID_ARGUMENT (3): Client provided malformed input. Don’t retry. Examples: missing sprite_id, TTL out of range, requested capabilities not a subset of available.
NOT_FOUND (5): Resource doesn’t exist (permanent). Fail fast. Examples: sprite not found, config missing.
INTERNAL (13): Server error (transient). Retry with backoff. Examples: database timeout, external service unavailable.

try:
    if not request.sprite_id:
        raise ValueError("sprite_id required")
    if request.ttl < 1 or request.ttl > 86400:
        raise ValueError("ttl out of range")
    
    sprite = store.get(request.sprite_id)
    if not sprite:
        raise KeyError(f"sprite not found")
    
    # ... generate token
except ValueError as e:
    context.abort(StatusCode.INVALID_ARGUMENT, str(e))
except KeyError as e:
    context.abort(StatusCode.NOT_FOUND, str(e))
except Exception as e:
    context.abort(StatusCode.INTERNAL, f"Server error: {str(e)}")

Client Retry Logic

Clients should implement retry strategies based on status:

INVALID_ARGUMENT → Fail immediately (no retry)
NOT_FOUND → Fail immediately (no retry)
INTERNAL → Retry with exponential backoff (1s, 2s, 4s, capped at 10s)
DEADLINE_EXCEEDED → Retry with longer timeout
UNKNOWN → Fail fast (indicates implementation bug)

Operational Impact

Resilience: Clients don’t waste resources retrying non-retryable errors.
Speed: Validation errors fail-fast instead of timing out after exponential backoff.
Load: No thundering herd retries; only genuine server errors trigger backoff.
Observability: Monitoring systems distinguish error categories and alert appropriately.

Implementation: See iris-meridian-adapter/SKILL.md (section 5, gRPC error propagation) for code examples
File reference: src/adapter/server.py:194-231 (GenerateCapabilityToken error routing)
Integration test: tests/test_token.py (error code validation tests)
gRPC spec: https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

​Summary

​The Problem

​The Solution: StatusCode Routing

​Client Retry Logic

​Operational Impact

​Related

Summary

The Problem

The Solution: StatusCode Routing

Client Retry Logic

Operational Impact

Related