Trama

Production Readiness

Use this guide to deploy Trama reliably in production with Redis Cluster, worker scaling, resilience controls, and strong operational visibility. This page is for teams moving from evaluation into real operational ownership.

Best for operators Deployment guidance Scale and resilience

Baseline Configuration

Running Redis Cluster in Production

Use Redis Cluster when one global ready queue becomes a hotspot. Trama spreads queue traffic across virtual shards and uses rendezvous hashing so each worker pod only polls the shards it owns.

redis:
  topology: "CLUSTER"
  cluster:
    nodes:
      - "redis://redis-cluster-0.redis:6379"
      - "redis://redis-cluster-1.redis:6379"
      - "redis://redis-cluster-2.redis:6379"
  sharding:
    podId: "${HOSTNAME}"
    virtualShardCount: 1024
    membershipKey: "saga:runtime:pods"
    membershipTtlMillis: 10000
    heartbeatIntervalMillis: 3000
    refreshIntervalMillis: 2000
    claimerCount: 4

During shutdown, a worker marks itself as not ready, stops claiming new Redis queue work, unregisters from shard membership, and drains already-claimed executions before closing Redis. That lets another worker take ownership without abandoning in-flight items.

Scaling Guidelines

Resilience Playbook

Security Checklist

Operational Checklist

AreaCheck
Availability/healthz and /readyz monitored.
Latencysaga_duration_seconds p95 tracked per definition.
Failuressaga_failed_total alert with reason dimension.
Queuesaga_enqueue_total vs saga_dequeue_total.
Redis Clustersaga_redis_active_pods, saga_redis_owned_shards, and saga_redis_membership_refresh_age_ms tracked.
TracingOTEL exporter healthy and sampled correctly.