Production Readiness
Use this guide to deploy Trama reliably in production with Redis Cluster, worker scaling, resilience controls, and strong operational visibility. This page is for teams moving from evaluation into real operational ownership.
Baseline Configuration
- Enable metrics and tracing.
- Use Postgres with tuned pool size.
- Set worker count according to CPU and downstream limits.
- Configure Redis with replication and proper memory policy.
- For larger worker fleets, prefer Redis Cluster plus shard-aware worker polling.
Running Redis Cluster in Production
Use Redis Cluster when one global ready queue becomes a hotspot. Trama spreads queue traffic across virtual shards and uses rendezvous hashing so each worker pod only polls the shards it owns.
- Set
redis.topologytoCLUSTERon every worker. - Provide at least three seed nodes in
redis.cluster.nodes. - Use a stable shared
redis.queue.keyPrefixandredis.sharding.membershipKeyacross the fleet. - Set
redis.sharding.podIdto a unique pod identity such as${HOSTNAME}. - Keep
redis.sharding.membershipTtlMilliscomfortably above the heartbeat interval. - Scale by adding worker replicas, not by manually partitioning queue keys.
redis:
topology: "CLUSTER"
cluster:
nodes:
- "redis://redis-cluster-0.redis:6379"
- "redis://redis-cluster-1.redis:6379"
- "redis://redis-cluster-2.redis:6379"
sharding:
podId: "${HOSTNAME}"
virtualShardCount: 1024
membershipKey: "saga:runtime:pods"
membershipTtlMillis: 10000
heartbeatIntervalMillis: 3000
refreshIntervalMillis: 2000
claimerCount: 4During shutdown, a worker marks itself as not ready, stops claiming new Redis queue work, unregisters from shard membership, and drains already-claimed executions before closing Redis. That lets another worker take ownership without abandoning in-flight items.
Scaling Guidelines
- Scale horizontally by adding runtime instances.
- Tune
runtime.workerCountand queue batch size together. - When running Redis Cluster, tune
redis.sharding.virtualShardCountandredis.sharding.claimerCountwith replica count. - Track enqueue/dequeue and p95 saga latency as primary signals.
Resilience Playbook
- Alert on failure ratio and queue pressure.
- Use retry endpoint for recoverable failed executions.
- Keep clear runbooks for Redis/Postgres outages.
- Document what happens when Redis membership becomes stale or a worker loses shard ownership.
Security Checklist
- Use TLS for ingress and internal service calls.
- Store credentials in secret manager, not plain env in CI logs.
- Restrict network access to Redis/Postgres.
- Harden callback endpoints with authentication/authorization.
Operational Checklist
| Area | Check |
|---|---|
| Availability | /healthz and /readyz monitored. |
| Latency | saga_duration_seconds p95 tracked per definition. |
| Failures | saga_failed_total alert with reason dimension. |
| Queue | saga_enqueue_total vs saga_dequeue_total. |
| Redis Cluster | saga_redis_active_pods, saga_redis_owned_shards, and saga_redis_membership_refresh_age_ms tracked. |
| Tracing | OTEL exporter healthy and sampled correctly. |