Production Readiness
Baseline Configuration
- Enable metrics and tracing.
- Use Postgres with tuned pool size.
- Set worker count according to CPU and downstream limits.
- Configure Redis with replication and proper memory policy.
- For larger worker fleets, prefer Redis Cluster plus shard-aware worker polling.
Running Redis Cluster in Production
Use Redis Cluster when one global ready queue becomes a hotspot. Trama spreads queue traffic across virtual shards and uses rendezvous hashing so each worker pod only polls the shards it owns.
- Set
redis.topologytoCLUSTERon every worker. - Provide at least three seed nodes in
redis.cluster.nodes. - Use a stable shared
redis.queue.keyPrefixandredis.sharding.membershipKeyacross the fleet. - Set
redis.sharding.podIdto a unique pod identity such as${HOSTNAME}. - Keep
redis.sharding.membershipTtlMilliscomfortably above the heartbeat interval. - Scale by adding worker replicas, not by manually partitioning queue keys.
redis:
topology: "CLUSTER"
cluster:
nodes:
- "redis://redis-cluster-0.redis:6379"
- "redis://redis-cluster-1.redis:6379"
- "redis://redis-cluster-2.redis:6379"
sharding:
podId: "${HOSTNAME}"
virtualShardCount: 1024
membershipKey: "saga:runtime:pods"
membershipTtlMillis: 10000
heartbeatIntervalMillis: 3000
refreshIntervalMillis: 2000
claimerCount: 4During shutdown, a worker marks itself as not ready, stops claiming new Redis queue work, unregisters from shard membership, and drains already-claimed executions before closing Redis. That lets another worker take ownership without abandoning in-flight items.
Scaling Guidelines
- Scale horizontally by adding runtime instances.
- Tune
runtime.workerCountand queue batch size together. - When running Redis Cluster, tune
redis.sharding.virtualShardCountandredis.sharding.claimerCountwith replica count. - Track enqueue/dequeue and p95 saga latency as primary signals.
Resilience Playbook
- Alert on failure ratio and queue pressure.
- Use retry endpoint for recoverable failed executions.
- Keep clear runbooks for Redis/Postgres outages.
- Document what happens when Redis membership becomes stale or a worker loses shard ownership.
Security Checklist
- Use TLS for ingress and internal service calls.
- Store credentials in secret manager, not plain env in CI logs.
- Restrict network access to Redis/Postgres.
- Harden callback endpoints with authentication/authorization.
Operational Checklist
| Area | Check |
|---|---|
| Availability | /healthz and /readyz monitored. |
| Latency | saga_duration_seconds p95 tracked per definition. |
| Failures | saga_failed_total alert with reason dimension. |
| Queue | saga_enqueue_total vs saga_dequeue_total. |
| Redis Cluster | saga_redis_active_pods, saga_redis_owned_shards, and saga_redis_membership_refresh_age_ms tracked. |
| Tracing | OTEL exporter healthy and sampled correctly. |