Production Readiness

Use this guide to deploy Trama reliably in production with Redis Cluster, worker scaling, resilience controls, and strong operational visibility. This page is for teams moving from evaluation into real operational ownership.

Best for operators Deployment guidance Scale and resilience

ConfigurationMap deployment decisions to the runtime settings. ObservabilityTrack the signals that prove a rollout is healthy. ArchitectureRefresh how the runtime behaves under load.

Baseline Configuration

Enable metrics and tracing.
Use Postgres with tuned pool size.
Set worker count according to CPU and downstream limits.
Configure Redis with replication and proper memory policy.
For larger worker fleets, prefer Redis Cluster plus shard-aware worker polling.

Running Redis Cluster in Production

Use Redis Cluster when one global ready queue becomes a hotspot. Trama spreads queue traffic across virtual shards and uses rendezvous hashing so each worker pod only polls the shards it owns.

Set redis.topology to CLUSTER on every worker.
Provide at least three seed nodes in redis.cluster.nodes.
Use a stable shared redis.queue.keyPrefix and redis.sharding.membershipKey across the fleet.
Set redis.sharding.podId to a unique pod identity such as ${HOSTNAME}.
Keep redis.sharding.membershipTtlMillis comfortably above the heartbeat interval.
Scale by adding worker replicas, not by manually partitioning queue keys.

redis:
  topology: "CLUSTER"
  cluster:
    nodes:
      - "redis://redis-cluster-0.redis:6379"
      - "redis://redis-cluster-1.redis:6379"
      - "redis://redis-cluster-2.redis:6379"
  sharding:
    podId: "${HOSTNAME}"
    virtualShardCount: 1024
    membershipKey: "saga:runtime:pods"
    membershipTtlMillis: 10000
    heartbeatIntervalMillis: 3000
    refreshIntervalMillis: 2000
    claimerCount: 4

During shutdown, a worker marks itself as not ready, stops claiming new Redis queue work, unregisters from shard membership, and drains already-claimed executions before closing Redis. That lets another worker take ownership without abandoning in-flight items.

Scaling Guidelines

Scale horizontally by adding runtime instances.
Tune runtime.workerCount and queue batch size together.
When running Redis Cluster, tune redis.sharding.virtualShardCount and redis.sharding.claimerCount with replica count.
Track enqueue/dequeue and p95 saga latency as primary signals.

Resilience Playbook

Alert on failure ratio and queue pressure.
Use retry endpoint for recoverable failed executions.
Keep clear runbooks for Redis/Postgres outages.
Document what happens when Redis membership becomes stale or a worker loses shard ownership.

Security Checklist

Use TLS for ingress and internal service calls.
Store credentials in secret manager, not plain env in CI logs.
Restrict network access to Redis/Postgres.
Harden callback endpoints with authentication/authorization.

Operational Checklist

Area	Check
Availability	`/healthz` and `/readyz` monitored.
Latency	`saga_duration_seconds` p95 tracked per definition.
Failures	`saga_failed_total` alert with reason dimension.
Queue	`saga_enqueue_total` vs `saga_dequeue_total`.
Redis Cluster	`saga_redis_active_pods`, `saga_redis_owned_shards`, and `saga_redis_membership_refresh_age_ms` tracked.
Tracing	OTEL exporter healthy and sampled correctly.