Observability

Track Trama runtime health with Prometheus metrics, Grafana dashboards, and OpenTelemetry traces across queue claims, failures, latency, and Redis shard ownership. Use this page when you need to answer “is the orchestrator healthy?” and “where are workflows getting stuck?”

Best for SREs Metrics and tracing Production operations

ProductionSee where these signals fit into rollout and scaling decisions. ConfigurationEnable the telemetry and runtime settings behind the metrics. APIMap runtime behavior back to workflow requests and retries.

What you can answer with this data

Use Trama observability to answer whether workflows are flowing through the queue, whether failures are climbing, whether Redis shard ownership is healthy, and whether end-to-end latency is staying within expectations.

Metrics Catalog

Metric	Type	Description
`saga_enqueue_total`	Counter	Queue ingress
`saga_dequeue_total`	Counter	Queue claims
`saga_processed_total`	Counter	Processed outcomes
`saga_failed_total`	Counter	Failures by reason
`saga_retried_total`	Counter	Retry scheduling rate
`saga_rate_limited_total`	Counter	Rate-limited executions
`saga_redis_claim_scans_total`	Counter	Redis shard claim scans performed by queue claimers.
`saga_redis_active_pods`	Gauge	Healthy worker pods seen in Redis membership.
`saga_redis_owned_shards`	Gauge	Virtual shards currently assigned to this worker.
`saga_redis_membership_refresh_age_ms`	Gauge	Age of the last successful membership refresh.
`saga_duration_seconds`	Histogram	End-to-end duration
`saga_step_duration_success_seconds`	Histogram	Successful step duration (v1)
`saga_node_duration_seconds`	Histogram	Per-node execution duration (v2)
`saga_callback_received_total`	Counter	Async callbacks delivered to the orchestrator
`saga_callback_rejected_total`	Counter	Async callbacks rejected (expired, bad signature, replay, attempt mismatch)
`saga_callback_timeout_total`	Counter	Async tasks that timed out waiting for a callback
`saga_switch_evaluations_total`	Counter	Switch node evaluations

The Redis cluster rollout adds worker-coordination metrics so you can observe shard ownership, claim pressure, and membership freshness without inspecting Redis directly.

Definition-Level Labels

saga_name, saga_version on counters and histograms.
phase on queue and processing counters.
outcome on processed, reason on failed and callback-rejected counters.
node_kind (task | switch) and mode (sync | async) on saga_node_duration_seconds.
result (case | default) on saga_switch_evaluations_total.
Redis membership gauges are process-level metrics and do not add definition labels.

Grafana Dashboard

Import grafana/trama-saga-dashboard.json.

Suggested Alerts

Failure ratio above threshold per definition.
p95 saga duration SLO breach.
Sustained enqueue/dequeue imbalance.
Unexpected rate-limit growth.
saga_redis_membership_refresh_age_ms above membership TTL.
saga_redis_active_pods dropping unexpectedly during rollout.
saga_redis_owned_shards equal to 0 on a worker pod that should be active.

Tracing

OpenTelemetry spans cover request handling and saga processing when telemetry is enabled.