Observability
Track Trama runtime health with Prometheus metrics, Grafana dashboards, and OpenTelemetry traces across queue claims, failures, latency, and Redis shard ownership. Use this page when you need to answer “is the orchestrator healthy?” and “where are workflows getting stuck?”
What you can answer with this data
Use Trama observability to answer whether workflows are flowing through the queue, whether failures are climbing, whether Redis shard ownership is healthy, and whether end-to-end latency is staying within expectations.
Metrics Catalog
| Metric | Type | Description |
|---|---|---|
saga_enqueue_total | Counter | Queue ingress |
saga_dequeue_total | Counter | Queue claims |
saga_processed_total | Counter | Processed outcomes |
saga_failed_total | Counter | Failures by reason |
saga_retried_total | Counter | Retry scheduling rate |
saga_rate_limited_total | Counter | Rate-limited executions |
saga_redis_claim_scans_total | Counter | Redis shard claim scans performed by queue claimers. |
saga_redis_active_pods | Gauge | Healthy worker pods seen in Redis membership. |
saga_redis_owned_shards | Gauge | Virtual shards currently assigned to this worker. |
saga_redis_membership_refresh_age_ms | Gauge | Age of the last successful membership refresh. |
saga_duration_seconds | Histogram | End-to-end duration |
saga_step_duration_success_seconds | Histogram | Successful step duration (v1) |
saga_node_duration_seconds | Histogram | Per-node execution duration (v2) |
saga_callback_received_total | Counter | Async callbacks delivered to the orchestrator |
saga_callback_rejected_total | Counter | Async callbacks rejected (expired, bad signature, replay, attempt mismatch) |
saga_callback_timeout_total | Counter | Async tasks that timed out waiting for a callback |
saga_switch_evaluations_total | Counter | Switch node evaluations |
The Redis cluster rollout adds worker-coordination metrics so you can observe shard ownership, claim pressure, and membership freshness without inspecting Redis directly.
Definition-Level Labels
saga_name,saga_versionon counters and histograms.phaseon queue and processing counters.outcomeon processed,reasonon failed and callback-rejected counters.node_kind(task|switch) andmode(sync|async) onsaga_node_duration_seconds.result(case|default) onsaga_switch_evaluations_total.- Redis membership gauges are process-level metrics and do not add definition labels.
Grafana Dashboard
Import grafana/trama-saga-dashboard.json.
Suggested Alerts
- Failure ratio above threshold per definition.
- p95 saga duration SLO breach.
- Sustained enqueue/dequeue imbalance.
- Unexpected rate-limit growth.
saga_redis_membership_refresh_age_msabove membership TTL.saga_redis_active_podsdropping unexpectedly during rollout.saga_redis_owned_shardsequal to0on a worker pod that should be active.
Tracing
OpenTelemetry spans cover request handling and saga processing when telemetry is enabled.