Observability
Metrics Catalog
| Metric | Type | Description |
|---|---|---|
saga_enqueue_total | Counter | Queue ingress |
saga_dequeue_total | Counter | Queue claims |
saga_processed_total | Counter | Processed outcomes |
saga_failed_total | Counter | Failures by reason |
saga_retried_total | Counter | Retry scheduling rate |
saga_rate_limited_total | Counter | Rate-limited executions |
saga_redis_claim_scans_total | Counter | Redis shard claim scans performed by queue claimers. |
saga_redis_active_pods | Gauge | Healthy worker pods seen in Redis membership. |
saga_redis_owned_shards | Gauge | Virtual shards currently assigned to this worker. |
saga_redis_membership_refresh_age_ms | Gauge | Age of the last successful membership refresh. |
saga_duration_seconds | Histogram | End-to-end duration |
saga_step_duration_success_seconds | Histogram | Successful step duration |
The Redis cluster rollout adds worker-coordination metrics so you can observe shard ownership, claim pressure, and membership freshness without inspecting Redis directly.
Definition-Level Labels
saga_name,saga_versionon counters and histograms.phaseon queue and processing counters.outcomeon processed,reasonon failed counters.- Redis membership gauges are process-level metrics and do not add definition labels.
Grafana Dashboard
Import grafana/trama-saga-dashboard.json.
Suggested Alerts
- Failure ratio above threshold per definition.
- p95 saga duration SLO breach.
- Sustained enqueue/dequeue imbalance.
- Unexpected rate-limit growth.
saga_redis_membership_refresh_age_msabove membership TTL.saga_redis_active_podsdropping unexpectedly during rollout.saga_redis_owned_shardsequal to0on a worker pod that should be active.
Tracing
OpenTelemetry spans cover request handling and saga processing when telemetry is enabled.