Level 4 · 35 min

Sagas

The Saga pattern manages distributed transactions across microservices. Since microservices have separate databases, ACID transactions spanning multiple services are impossible without 2PC (which is impractical at scale). A Saga is a sequence of local transactions, each publishing events to trigger the next step, with compensating transactions for rollback.

Saga Definition and Motivation

In a monolith, a business transaction (place order → reserve stock → charge payment → ship) runs in a single ACID transaction. In microservices, each service has its own DB — no cross-service transactions. 2PC (Two-Phase Commit) creates a distributed lock that kills availability: if the coordinator crashes after PREPARE but before COMMIT, all participants hold locks indefinitely — the system is stuck. Sagas replace 2PC: each step is a local transaction that commits immediately. If step N fails, steps 1..N-1 are compensated (undone via business operations). Temporal.io classifies saga steps into three categories that determine rollback behavior: (1) Compensatable steps — can be semantically undone (ReserveInventory → CancelReservation); these are steps before the pivot. (2) Pivot transaction — the point of no return; once this commits, the saga cannot be rolled back (e.g., charging a credit card is often the pivot — the charge went through). (3) Retriable steps — steps after the pivot that must eventually succeed (e.g., sending a confirmation email); these are retried with backoff until success, never compensated. Identifying the pivot correctly is the most important design decision in a saga.

Choreography vs Orchestration

Choreography: each service listens for events and publishes new events when its step completes. No central coordinator — services react to events. The OrderService publishes OrderPlaced → InventoryService listens, reserves stock, publishes StockReserved → PaymentService listens, charges card, publishes PaymentCharged. Advantages: fully decoupled, no orchestrator as a bottleneck. Disadvantages: the flow is implicit — understanding the full saga requires reading all participating services. Testing is hard (requires integrating all services). Compensations are also event-driven, so rollback logic is spread across services with no single view of saga state. Debugging a stuck saga means correlating events across multiple service logs. Orchestration: a central Saga Orchestrator (a durable, persistent state machine) sends commands to services and waits for success/failure events. Temporal.io is the gold standard: the orchestrator is a Workflow function that runs durably — it can pause for days waiting for an async response, survive process restarts, and replay its history to reconstruct state after a crash without losing position. Temporal's WorkflowHistory records every input/output; replayed workflows are deterministic. Activity functions are the individual steps — each Activity has a retry policy, timeout, and heartbeat. The orchestrator calls Activities, handles failures, and runs compensations in native code (no state machine DSL needed). Prefer orchestration for sagas with >3 steps, complex compensation logic, or strict observability requirements. Choreography is fine for simple 2-step reactive flows where coupling is genuinely undesirable. Key insight from Designing Data-Intensive Applications: The saga pattern avoids 2PC's availability problem by replacing distributed locking with local transactions and compensating operations. Kleppmann notes that 2PC's coordinator is a single point of failure: "if the coordinator crashes after sending PREPARE but before sending the commit or abort decision, the participants are stuck waiting with their locks held — this situation is called in doubt or uncertain." Sagas trade atomicity for availability: each local transaction commits immediately and is visible to other services, so compensation is a business-level undo (RefundPayment) rather than a database ROLLBACK. The practical implication is that saga state must be persisted before any command is dispatched — if the orchestrator crashes between dispatching a command and recording that it did so, on recovery it will dispatch again, requiring all saga participants to be idempotent.

Compensating Transactions and Failure Handling

Compensating transactions are business-level undos — not database ROLLBACKs. Each step that has side effects must have a compensating transaction: ReserveInventory → ReleaseInventory, ChargePayment → RefundPayment. Compensations are not exact reversals — a charged credit card can be refunded but the charge briefly appeared on the statement. Idempotency is mandatory: the orchestrator retries both forward steps and compensations on timeout or transient failure. Each step handler must implement idempotency by checking a processed_saga_steps table (`saga_id + step_name` as unique key) before applying effects. Failure scenarios and handling: (1) Forward step fails (retryable error) → retry with exponential backoff + jitter, up to `maxAttempts`. (2) Forward step fails (non-retryable, before pivot) → begin compensation sequence from current step backward. (3) Compensation fails → retry compensation — compensations must eventually succeed. If after N retries compensation still fails, the saga is in a `COMPENSATION_FAILED` terminal state. Page on-call immediately; this requires human intervention (manual data fix + resume). (4) Orchestrator crashes mid-saga → on restart, reload saga state from DB and continue from last confirmed step. This is why Temporal.io is valuable — it handles this automatically via workflow history replay. Timeout hierarchy in Temporal: `scheduleToCloseTimeout` (total deadline for the Activity including queue time), `startToCloseTimeout` (time from start to completion), `heartbeatTimeout` (for long-running activities — the Activity must call `activity.Heartbeat()` periodically or be considered dead).

Key Takeaways

Saga is the alternative to 2PC — local transactions + compensating transactions instead of distributed locking.
Prefer orchestration for multi-step sagas — the flow is explicit, testable, and observable in a single place.
All saga steps must be idempotent — the orchestrator will retry on failure, and the same step may execute multiple times.

Code example

// Temporal.io orchestrated saga — durable, resumable
@WorkflowInterface
interface OrderSagaWorkflow {
  @WorkflowMethod
  OrderResult processOrder(OrderCommand cmd);
}

class OrderSagaWorkflowImpl implements OrderSagaWorkflow {
  // Activities are injected with retry + timeout policy
  private final InventoryActivity inventory = Workflow.newActivityStub(
    InventoryActivity.class,
    ActivityOptions.newBuilder()
      .setStartToCloseTimeout(Duration.ofSeconds(30))
      .setRetryOptions(RetryOptions.newBuilder().setMaximumAttempts(5).build())
      .build());
  private final PaymentActivity payment = Workflow.newActivityStub(/* similar */);

  public OrderResult processOrder(OrderCommand cmd) {
    // Compensatable step 1
    inventory.reserve(cmd.getOrderId(), cmd.getItems());
    try {
      // PIVOT: once charged, cannot roll back
      payment.charge(cmd.getOrderId(), cmd.getTotalAmount());
    } catch (ApplicationFailure e) {
      // Compensate step 1 before pivot
      inventory.release(cmd.getOrderId());  // retriable compensation
      return OrderResult.failed(e.getMessage());
    }
    // Retriable steps after pivot — retry forever until success
    notification.sendConfirmation(cmd.getOrderId());  // never compensate
    return OrderResult.completed(cmd.getOrderId());
  }
}