Summary
gRPC status codes signal clients about whether to retry, fail fast, or escalate. Proper error propagation routes validation failures (INVALID_ARGUMENT), not-found errors (NOT_FOUND), and server errors (INTERNAL) to appropriate status codes. The IRIS MERIDIAN adapter demonstrates this in the GenerateCapabilityToken RPC, preventing thundering herd retries and improving reliability.The Problem
Without explicit status code routing, all errors look identical. A malformed request (typo in sprite_id) and a genuine server failure both become StatusCode.UNKNOWN. Clients treat both as transient failures and retry with exponential backoff, wasting bandwidth and delaying failure detection. Monitoring systems can’t distinguish retryable from permanent errors.The Solution: StatusCode Routing
Map exception types to gRPC status codes at the RPC boundary:- INVALID_ARGUMENT (3): Client provided malformed input. Don’t retry. Examples: missing sprite_id, TTL out of range, requested capabilities not a subset of available.
- NOT_FOUND (5): Resource doesn’t exist (permanent). Fail fast. Examples: sprite not found, config missing.
- INTERNAL (13): Server error (transient). Retry with backoff. Examples: database timeout, external service unavailable.
Client Retry Logic
Clients should implement retry strategies based on status:Operational Impact
- Resilience: Clients don’t waste resources retrying non-retryable errors.
- Speed: Validation errors fail-fast instead of timing out after exponential backoff.
- Load: No thundering herd retries; only genuine server errors trigger backoff.
- Observability: Monitoring systems distinguish error categories and alert appropriately.
Related
- Implementation: See iris-meridian-adapter/SKILL.md (section 5, gRPC error propagation) for code examples
- File reference:
src/adapter/server.py:194-231(GenerateCapabilityToken error routing) - Integration test:
tests/test_token.py(error code validation tests) - gRPC spec: https://github.com/grpc/grpc/blob/master/doc/statuscodes.md