Trama

Production Readiness

Baseline Configuration

Running Redis Cluster in Production

Use Redis Cluster when one global ready queue becomes a hotspot. Trama spreads queue traffic across virtual shards and uses rendezvous hashing so each worker pod only polls the shards it owns.

redis:
  topology: "CLUSTER"
  cluster:
    nodes:
      - "redis://redis-cluster-0.redis:6379"
      - "redis://redis-cluster-1.redis:6379"
      - "redis://redis-cluster-2.redis:6379"
  sharding:
    podId: "${HOSTNAME}"
    virtualShardCount: 1024
    membershipKey: "saga:runtime:pods"
    membershipTtlMillis: 10000
    heartbeatIntervalMillis: 3000
    refreshIntervalMillis: 2000
    claimerCount: 4

During shutdown, a worker marks itself as not ready, stops claiming new Redis queue work, unregisters from shard membership, and drains already-claimed executions before closing Redis. That lets another worker take ownership without abandoning in-flight items.

Scaling Guidelines

Resilience Playbook

Security Checklist

Operational Checklist

AreaCheck
Availability/healthz and /readyz monitored.
Latencysaga_duration_seconds p95 tracked per definition.
Failuressaga_failed_total alert with reason dimension.
Queuesaga_enqueue_total vs saga_dequeue_total.
Redis Clustersaga_redis_active_pods, saga_redis_owned_shards, and saga_redis_membership_refresh_age_ms tracked.
TracingOTEL exporter healthy and sampled correctly.