Appearance
Saga Pattern
Introduction
The Saga Pattern is a distributed transaction management strategy that breaks a long-lived transaction into a sequence of smaller, local transactions, each with a corresponding compensating action to undo its effects if a later step fails. In microservices architectures where traditional two-phase commit (2PC) is impractical due to service autonomy and network unreliability, sagas provide a way to maintain data consistency across service boundaries without distributed locks. Understanding this pattern is essential for anyone designing resilient, eventually consistent distributed systems.
Core Concepts
The Problem with Distributed Transactions
In a monolithic application, a single database transaction can atomically update multiple tables. In a microservices architecture, each service owns its own database, making ACID transactions across services impossible without coordination protocols like 2PC — which introduce tight coupling, blocking, and a single point of failure.
The Saga Pattern solves this by replacing a single distributed transaction with a chain of local transactions. Each local transaction updates data within one service and publishes an event or sends a command to trigger the next step.
Saga Guarantees
Sagas provide ACD guarantees (no Isolation):
- Atomicity: All steps complete or all compensations run
- Consistency: The system moves from one valid state to another
- Durability: Each local transaction is durable
The lack of Isolation means intermediate states are visible to other transactions. This requires careful handling through countermeasures like semantic locks, commutative updates, and read-your-writes patterns.
Two Coordination Strategies
There are two primary approaches to coordinating a saga:
- Choreography: Each service listens for events from other services and decides locally whether to act. There is no central controller.
- Orchestration: A central orchestrator (saga execution coordinator) tells each participant what to do and when, managing the overall flow.
Forward Recovery vs. Backward Recovery
When a step in a saga fails, the system has two options:
- Backward Recovery: Execute compensating transactions for all previously completed steps (rollback semantics)
- Forward Recovery: Retry the failed step until it succeeds (requires idempotent operations)
Step Classification
Saga steps fall into three categories:
- Compensatable Transactions: Steps that can be undone by a compensating transaction (e.g., reserve inventory → cancel reservation)
- Pivot Transaction: The go/no-go decision point. If it succeeds, the saga will run to completion. If it fails, compensation begins.
- Retriable Transactions: Steps after the pivot that are guaranteed to eventually succeed (e.g., sending a confirmation email)
Implementation: Choreography-Based Saga
In the choreography approach, each service publishes domain events that other services subscribe to. Let's model an e-commerce order flow.
When payment fails, compensation flows backward:
Java Implementation — Choreography
java
import java.util.*;
import java.util.concurrent.*;
import java.util.function.Consumer;
// Simple in-memory event bus for demonstration
public class EventBus {
private final Map<String, List<Consumer<Map<String, Object>>>> subscribers = new ConcurrentHashMap<>();
public void subscribe(String eventType, Consumer<Map<String, Object>> handler) {
subscribers.computeIfAbsent(eventType, k -> new CopyOnWriteArrayList<>()).add(handler);
}
public void publish(String eventType, Map<String, Object> payload) {
System.out.printf("[EventBus] Publishing: %s -> %s%n", eventType, payload);
List<Consumer<Map<String, Object>>> handlers = subscribers.getOrDefault(eventType, List.of());
for (Consumer<Map<String, Object>> handler : handlers) {
CompletableFuture.runAsync(() -> handler.accept(payload));
}
}
}java
public class OrderService {
private final EventBus eventBus;
private final Map<String, String> orders = new ConcurrentHashMap<>();
public OrderService(EventBus eventBus) {
this.eventBus = eventBus;
// Listen for saga completion/failure events
eventBus.subscribe("OrderShipped", this::onOrderShipped);
eventBus.subscribe("StockReleased", this::onStockReleased);
eventBus.subscribe("StockReservationFailed", this::onReservationFailed);
}
public String createOrder(String customerId, String productId, int quantity) {
String orderId = UUID.randomUUID().toString().substring(0, 8);
orders.put(orderId, "PENDING");
System.out.printf("[OrderService] Order %s created (PENDING)%n", orderId);
Map<String, Object> event = Map.of(
"orderId", orderId,
"customerId", customerId,
"productId", productId,
"quantity", quantity
);
eventBus.publish("OrderCreated", event);
return orderId;
}
private void onOrderShipped(Map<String, Object> event) {
String orderId = (String) event.get("orderId");
orders.put(orderId, "CONFIRMED");
System.out.printf("[OrderService] Order %s CONFIRMED%n", orderId);
}
private void onStockReleased(Map<String, Object> event) {
String orderId = (String) event.get("orderId");
orders.put(orderId, "CANCELLED");
System.out.printf("[OrderService] Order %s CANCELLED (compensation complete)%n", orderId);
}
private void onReservationFailed(Map<String, Object> event) {
String orderId = (String) event.get("orderId");
orders.put(orderId, "CANCELLED");
System.out.printf("[OrderService] Order %s CANCELLED (no stock)%n", orderId);
}
public String getStatus(String orderId) {
return orders.getOrDefault(orderId, "NOT_FOUND");
}
}java
public class InventoryService {
private final EventBus eventBus;
private final Map<String, Integer> stock = new ConcurrentHashMap<>();
private final Set<String> reservations = ConcurrentHashMap.newKeySet();
public InventoryService(EventBus eventBus) {
this.eventBus = eventBus;
stock.put("PROD-001", 10);
eventBus.subscribe("OrderCreated", this::onOrderCreated);
eventBus.subscribe("PaymentFailed", this::onPaymentFailed);
}
private void onOrderCreated(Map<String, Object> event) {
String orderId = (String) event.get("orderId");
String productId = (String) event.get("productId");
int quantity = (int) event.get("quantity");
int available = stock.getOrDefault(productId, 0);
if (available >= quantity) {
stock.put(productId, available - quantity);
reservations.add(orderId);
System.out.printf("[InventoryService] Stock reserved for order %s%n", orderId);
eventBus.publish("StockReserved", Map.of(
"orderId", orderId,
"customerId", event.get("customerId"),
"amount", quantity * 25.0
));
} else {
System.out.printf("[InventoryService] Insufficient stock for order %s%n", orderId);
eventBus.publish("StockReservationFailed", Map.of("orderId", orderId));
}
}
// Compensating transaction
private void onPaymentFailed(Map<String, Object> event) {
String orderId = (String) event.get("orderId");
if (reservations.remove(orderId)) {
String productId = (String) event.getOrDefault("productId", "PROD-001");
stock.merge(productId, 1, Integer::sum); // simplified
System.out.printf("[InventoryService] COMPENSATING: Stock released for order %s%n", orderId);
eventBus.publish("StockReleased", Map.of("orderId", orderId));
}
}
}java
public class PaymentService {
private final EventBus eventBus;
private final boolean simulateFailure;
public PaymentService(EventBus eventBus, boolean simulateFailure) {
this.eventBus = eventBus;
this.simulateFailure = simulateFailure;
eventBus.subscribe("StockReserved", this::onStockReserved);
}
private void onStockReserved(Map<String, Object> event) {
String orderId = (String) event.get("orderId");
double amount = (double) event.get("amount");
if (simulateFailure) {
System.out.printf("[PaymentService] Payment FAILED for order %s ($%.2f)%n", orderId, amount);
eventBus.publish("PaymentFailed", Map.of("orderId", orderId));
} else {
System.out.printf("[PaymentService] Payment processed for order %s ($%.2f)%n", orderId, amount);
eventBus.publish("PaymentProcessed", Map.of("orderId", orderId));
}
}
}java
public class ChoreographySagaDemo {
public static void main(String[] args) throws InterruptedException {
System.out.println("=== Happy Path ===");
EventBus bus1 = new EventBus();
OrderService orderService1 = new OrderService(bus1);
new InventoryService(bus1);
new PaymentService(bus1, false);
// Simplified shipping handler
bus1.subscribe("PaymentProcessed", event -> {
System.out.printf("[ShippingService] Shipment scheduled for order %s%n", event.get("orderId"));
bus1.publish("OrderShipped", event);
});
orderService1.createOrder("CUST-1", "PROD-001", 2);
Thread.sleep(1000);
System.out.println("\n=== Failure Path (Payment Fails) ===");
EventBus bus2 = new EventBus();
OrderService orderService2 = new OrderService(bus2);
new InventoryService(bus2);
new PaymentService(bus2, true); // simulate failure
orderService2.createOrder("CUST-2", "PROD-001", 1);
Thread.sleep(1000);
}
}Implementation: Orchestration-Based Saga
The orchestrator approach centralizes saga logic in a single component that drives the workflow.
java
import java.util.*;
import java.util.function.Function;
public class SagaOrchestrator {
public enum StepStatus { SUCCESS, FAILURE }
public static class SagaStep {
private final String name;
private final Function<Map<String, Object>, StepStatus> action;
private final Function<Map<String, Object>, StepStatus> compensation;
public SagaStep(String name,
Function<Map<String, Object>, StepStatus> action,
Function<Map<String, Object>, StepStatus> compensation) {
this.name = name;
this.action = action;
this.compensation = compensation;
}
}
private final List<SagaStep> steps = new ArrayList<>();
private final List<SagaStep> completedSteps = new ArrayList<>();
public SagaOrchestrator addStep(String name,
Function<Map<String, Object>, StepStatus> action,
Function<Map<String, Object>, StepStatus> compensation) {
steps.add(new SagaStep(name, action, compensation));
return this;
}
public boolean execute(Map<String, Object> context) {
System.out.println("\n[Orchestrator] Starting saga execution...");
for (SagaStep step : steps) {
System.out.printf("[Orchestrator] Executing step: %s%n", step.name);
StepStatus result = step.action.apply(context);
if (result == StepStatus.SUCCESS) {
System.out.printf("[Orchestrator] Step '%s' succeeded%n", step.name);
completedSteps.add(step);
} else {
System.out.printf("[Orchestrator] Step '%s' FAILED — starting compensation%n", step.name);
compensate(context);
return false;
}
}
System.out.println("[Orchestrator] Saga completed successfully!");
return true;
}
private void compensate(Map<String, Object> context) {
System.out.println("[Orchestrator] Running compensating transactions...");
// Compensate in reverse order
ListIterator<SagaStep> it = completedSteps.listIterator(completedSteps.size());
while (it.hasPrevious()) {
SagaStep step = it.previous();
System.out.printf("[Orchestrator] Compensating: %s%n", step.name);
step.compensation.apply(context);
}
System.out.println("[Orchestrator] All compensations complete.");
}
}java
import java.util.*;
public class OrchestrationSagaDemo {
static SagaOrchestrator.StepStatus createOrder(Map<String, Object> ctx) {
String orderId = UUID.randomUUID().toString().substring(0, 8);
ctx.put("orderId", orderId);
ctx.put("orderStatus", "CREATED");
System.out.printf(" [OrderService] Order %s created%n", orderId);
return SagaOrchestrator.StepStatus.SUCCESS;
}
static SagaOrchestrator.StepStatus cancelOrder(Map<String, Object> ctx) {
ctx.put("orderStatus", "CANCELLED");
System.out.printf(" [OrderService] Order %s cancelled%n", ctx.get("orderId"));
return SagaOrchestrator.StepStatus.SUCCESS;
}
static SagaOrchestrator.StepStatus reserveStock(Map<String, Object> ctx) {
System.out.printf(" [InventoryService] Stock reserved for order %s%n", ctx.get("orderId"));
ctx.put("stockReserved", true);
return SagaOrchestrator.StepStatus.SUCCESS;
}
static SagaOrchestrator.StepStatus releaseStock(Map<String, Object> ctx) {
System.out.printf(" [InventoryService] Stock released for order %s%n", ctx.get("orderId"));
ctx.put("stockReserved", false);
return SagaOrchestrator.StepStatus.SUCCESS;
}
static SagaOrchestrator.StepStatus processPayment(Map<String, Object> ctx) {
boolean shouldFail = (boolean) ctx.getOrDefault("simulatePaymentFailure", false);
if (shouldFail) {
System.out.printf(" [PaymentService] Payment DECLINED for order %s%n", ctx.get("orderId"));
return SagaOrchestrator.StepStatus.FAILURE;
}
System.out.printf(" [PaymentService] Payment charged for order %s%n", ctx.get("orderId"));
ctx.put("paymentProcessed", true);
return SagaOrchestrator.StepStatus.SUCCESS;
}
static SagaOrchestrator.StepStatus refundPayment(Map<String, Object> ctx) {
System.out.printf(" [PaymentService] Payment refunded for order %s%n", ctx.get("orderId"));
ctx.put("paymentProcessed", false);
return SagaOrchestrator.StepStatus.SUCCESS;
}
static SagaOrchestrator.StepStatus scheduleShipment(Map<String, Object> ctx) {
System.out.printf(" [ShippingService] Shipment scheduled for order %s%n", ctx.get("orderId"));
return SagaOrchestrator.StepStatus.SUCCESS;
}
static SagaOrchestrator.StepStatus cancelShipment(Map<String, Object> ctx) {
System.out.printf(" [ShippingService] Shipment cancelled for order %s%n", ctx.get("orderId"));
return SagaOrchestrator.StepStatus.SUCCESS;
}
public static void main(String[] args) {
// Happy path
System.out.println("========== HAPPY PATH ==========");
SagaOrchestrator saga1 = new SagaOrchestrator()
.addStep("Create Order", OrchestrationSagaDemo::createOrder, OrchestrationSagaDemo::cancelOrder)
.addStep("Reserve Stock", OrchestrationSagaDemo::reserveStock, OrchestrationSagaDemo::releaseStock)
.addStep("Process Payment", OrchestrationSagaDemo::processPayment, OrchestrationSagaDemo::refundPayment)
.addStep("Schedule Shipment", OrchestrationSagaDemo::scheduleShipment, OrchestrationSagaDemo::cancelShipment);
Map<String, Object> ctx1 = new HashMap<>();
saga1.execute(ctx1);
System.out.printf("Final order status: %s%n", ctx1.get("orderStatus"));
// Failure path
System.out.println("\n========== FAILURE PATH (Payment Fails) ==========");
SagaOrchestrator saga2 = new SagaOrchestrator()
.addStep("Create Order", OrchestrationSagaDemo::createOrder, OrchestrationSagaDemo::cancelOrder)
.addStep("Reserve Stock", OrchestrationSagaDemo::reserveStock, OrchestrationSagaDemo::releaseStock)
.addStep("Process Payment", OrchestrationSagaDemo::processPayment, OrchestrationSagaDemo::refundPayment)
.addStep("Schedule Shipment", OrchestrationSagaDemo::scheduleShipment, OrchestrationSagaDemo::cancelShipment);
Map<String, Object> ctx2 = new HashMap<>();
ctx2.put("simulatePaymentFailure", true);
saga2.execute(ctx2);
System.out.printf("Final order status: %s%n", ctx2.get("orderStatus"));
}
}Choreography vs. Orchestration
Handling the Isolation Problem
Since sagas lack isolation, concurrent sagas can interfere with each other. Several countermeasures exist:
Semantic Lock Example
java
public class OrderWithSemanticLock {
public enum Status {
PENDING, // Semantic lock — saga in progress
APPROVAL_PENDING, // Semantic lock — awaiting approval
CONFIRMED, // Final state
CANCELLED // Final state
}
private final String orderId;
private Status status;
private int version;
public OrderWithSemanticLock(String orderId) {
this.orderId = orderId;
this.status = Status.PENDING; // Lock acquired
this.version = 0;
}
public synchronized boolean tryTransition(Status expected, Status next) {
if (this.status != expected) {
System.out.printf(" Transition rejected: expected %s but was %s%n", expected, this.status);
return false;
}
this.status = next;
this.version++;
System.out.printf(" Order %s: %s -> %s (v%d)%n", orderId, expected, next, version);
return true;
}
public boolean isLocked() {
return status == Status.PENDING || status == Status.APPROVAL_PENDING;
}
public static void main(String[] args) {
OrderWithSemanticLock order = new OrderWithSemanticLock("ORD-100");
System.out.println("Order locked (in saga): " + order.isLocked());
// Saga step completes
order.tryTransition(Status.PENDING, Status.APPROVAL_PENDING);
System.out.println("Still locked: " + order.isLocked());
// Final saga step — lock released
order.tryTransition(Status.APPROVAL_PENDING, Status.CONFIRMED);
System.out.println("Locked: " + order.isLocked());
}
}Saga State Machine
A robust orchestrator tracks the saga's state persistently to survive crashes:
AWS Integration Example
AWS Step Functions is a natural fit for the orchestration-based saga pattern. Each state machine step invokes a Lambda function, with catch blocks defining compensating actions.
Best Practices
Design compensating transactions carefully: Compensations must be semantically correct undo operations, not necessarily database rollbacks. A payment compensation is a refund, not deleting the payment record.
Make all steps idempotent: Network retries and at-least-once delivery mean steps may execute multiple times. Use idempotency keys to prevent duplicate side effects.
Persist saga state: Store the saga's current step and context in a durable store (database or Step Functions) so it can resume after orchestrator crashes.
Use correlation IDs: Every event and command in a saga should carry a unique saga ID (correlation ID) to trace and debug the entire transaction flow.
Prefer orchestration for complex sagas: When you have more than three to four steps, choreography becomes difficult to reason about. Orchestration provides a single place to understand the complete flow.
Implement timeouts on every step: Without timeouts, a saga can hang indefinitely if a participant service becomes unresponsive. Use deadlines to trigger compensation.
Handle partial failures in compensation: Compensating transactions themselves can fail. Implement retry logic with exponential backoff for compensations, and have a dead letter queue for manual intervention.
Apply semantic locks for isolation: Flag records that are part of an in-progress saga so concurrent sagas can detect and handle conflicts appropriately.
Order steps to minimize compensation cost: Place the most likely-to-fail or most expensive-to-compensate steps as early as possible (before the pivot transaction) to reduce wasted work.
Test failure scenarios extensively: Happy path testing is insufficient. Test every possible failure point and verify that compensations restore the system to a consistent state.
Related Concepts
- Eventual Consistency: Sagas are a mechanism for achieving eventual consistency across distributed services.
- Asynchronous Programming: Choreography-based sagas rely heavily on asynchronous event-driven communication.
- Serverless and Container Workloads: AWS Step Functions and Lambda provide managed infrastructure for orchestration-based sagas.