Skip to content

Resilience Patterns

Introduction

Resilience patterns are architectural strategies that enable distributed systems to gracefully handle failures, maintain partial availability, and recover automatically from transient or permanent faults. In modern microservice architectures — where a single user request may traverse dozens of network boundaries — a failure in one component can cascade and bring down an entire platform. Understanding and implementing resilience patterns is essential for building production-grade systems that meet SLA requirements and deliver a reliable user experience.

Core Concepts

What Makes a System Resilient?

A resilient system doesn't avoid failures — it expects them. The fundamental shift is from failure prevention to failure management. Resilience is achieved through a combination of patterns that address different failure modes:

  • Transient failures: Temporary network blips, brief service unavailability
  • Overload failures: Services overwhelmed by traffic spikes
  • Dependency failures: Downstream services crashing or degrading
  • Cascading failures: One failure propagating across the entire system

The Five Pillars of Resilience

Retry Pattern

The retry pattern re-attempts a failed operation on the assumption that the failure is transient. The key is to use exponential backoff with jitter to avoid thundering herd problems where all clients retry simultaneously.

Retry Strategy Comparison

Implementation: Retry with Exponential Backoff and Jitter

java
import java.time.Duration;
import java.util.Random;
import java.util.function.Supplier;

public class RetryPolicy<T> {

    private final int maxRetries;
    private final Duration baseDelay;
    private final Duration maxDelay;
    private final Random random = new Random();

    public RetryPolicy(int maxRetries, Duration baseDelay, Duration maxDelay) {
        this.maxRetries = maxRetries;
        this.baseDelay = baseDelay;
        this.maxDelay = maxDelay;
    }

    public T execute(Supplier<T> operation) throws Exception {
        Exception lastException = null;

        for (int attempt = 0; attempt <= maxRetries; attempt++) {
            try {
                return operation.get();
            } catch (Exception e) {
                lastException = e;
                if (attempt == maxRetries) {
                    break;
                }

                long delayMs = calculateDelay(attempt);
                System.out.printf("Attempt %d failed: %s. Retrying in %dms...%n",
                        attempt + 1, e.getMessage(), delayMs);

                try {
                    Thread.sleep(delayMs);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("Retry interrupted", ie);
                }
            }
        }

        throw new RuntimeException(
                "Operation failed after " + (maxRetries + 1) + " attempts", lastException);
    }

    private long calculateDelay(int attempt) {
        // Exponential backoff: baseDelay * 2^attempt
        long exponentialDelay = baseDelay.toMillis() * (1L << attempt);
        // Cap at maxDelay
        long cappedDelay = Math.min(exponentialDelay, maxDelay.toMillis());
        // Add jitter: random value between 0 and cappedDelay
        return (long) (cappedDelay * random.nextDouble());
    }

    public static void main(String[] args) throws Exception {
        RetryPolicy<String> retry = new RetryPolicy<>(3, Duration.ofMillis(500), Duration.ofSeconds(10));

        // Simulating a flaky service
        final int[] callCount = {0};

        String result = retry.execute(() -> {
            callCount[0]++;
            if (callCount[0] < 3) {
                throw new RuntimeException("Service temporarily unavailable");
            }
            return "Success on attempt " + callCount[0];
        });

        System.out.println("Result: " + result);
    }
}

Circuit Breaker Pattern

The circuit breaker pattern prevents a system from repeatedly calling a service that is likely to fail. Like an electrical circuit breaker, it "trips" when failures exceed a threshold, fast-failing subsequent requests without consuming resources.

Circuit Breaker State Machine

Implementation: Circuit Breaker

java
import java.time.Duration;
import java.time.Instant;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicReference;
import java.util.function.Supplier;

public class CircuitBreaker<T> {

    public enum State { CLOSED, OPEN, HALF_OPEN }

    private final int failureThreshold;
    private final Duration openStateDuration;
    private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
    private final AtomicInteger failureCount = new AtomicInteger(0);
    private final AtomicInteger successCount = new AtomicInteger(0);
    private volatile Instant openedAt;

    public CircuitBreaker(int failureThreshold, Duration openStateDuration) {
        this.failureThreshold = failureThreshold;
        this.openStateDuration = openStateDuration;
    }

    public T execute(Supplier<T> operation, Supplier<T> fallback) {
        State currentState = getEffectiveState();

        switch (currentState) {
            case OPEN:
                System.out.println("[CIRCUIT BREAKER] OPEN — fast failing");
                return fallback.get();

            case HALF_OPEN:
                System.out.println("[CIRCUIT BREAKER] HALF_OPEN — allowing probe");
                return executeProbe(operation, fallback);

            case CLOSED:
            default:
                return executeNormally(operation, fallback);
        }
    }

    private State getEffectiveState() {
        if (state.get() == State.OPEN && openedAt != null) {
            if (Instant.now().isAfter(openedAt.plus(openStateDuration))) {
                state.compareAndSet(State.OPEN, State.HALF_OPEN);
            }
        }
        return state.get();
    }

    private T executeNormally(Supplier<T> operation, Supplier<T> fallback) {
        try {
            T result = operation.get();
            onSuccess();
            return result;
        } catch (Exception e) {
            onFailure();
            System.out.printf("[CIRCUIT BREAKER] Failure %d/%d: %s%n",
                    failureCount.get(), failureThreshold, e.getMessage());
            return fallback.get();
        }
    }

    private T executeProbe(Supplier<T> operation, Supplier<T> fallback) {
        try {
            T result = operation.get();
            reset();
            System.out.println("[CIRCUIT BREAKER] Probe succeeded — closing circuit");
            return result;
        } catch (Exception e) {
            tripBreaker();
            System.out.println("[CIRCUIT BREAKER] Probe failed — reopening circuit");
            return fallback.get();
        }
    }

    private void onSuccess() {
        failureCount.set(0);
        successCount.incrementAndGet();
    }

    private void onFailure() {
        int failures = failureCount.incrementAndGet();
        if (failures >= failureThreshold) {
            tripBreaker();
        }
    }

    private void tripBreaker() {
        state.set(State.OPEN);
        openedAt = Instant.now();
        System.out.println("[CIRCUIT BREAKER] Circuit OPENED");
    }

    private void reset() {
        state.set(State.CLOSED);
        failureCount.set(0);
        successCount.set(0);
    }

    public State getState() {
        return getEffectiveState();
    }

    public static void main(String[] args) throws InterruptedException {
        CircuitBreaker<String> breaker = new CircuitBreaker<>(3, Duration.ofSeconds(5));

        // Simulate failures to trip the breaker
        for (int i = 0; i < 5; i++) {
            String result = breaker.execute(
                () -> { throw new RuntimeException("Service down"); },
                () -> "[Fallback] Cached result"
            );
            System.out.println("Got: " + result + " | State: " + breaker.getState());
        }

        // Wait for timeout, then circuit should move to HALF_OPEN
        System.out.println("\nWaiting for open state timeout...");
        Thread.sleep(6000);

        // Probe with a successful call
        String recovered = breaker.execute(
            () -> "Live service response",
            () -> "[Fallback]"
        );
        System.out.println("Got: " + recovered + " | State: " + breaker.getState());
    }
}

Bulkhead Pattern

The bulkhead pattern isolates different parts of a system so that a failure in one area doesn't exhaust the resources of another. Named after the watertight compartments in a ship's hull, this pattern uses separate thread pools or semaphores to contain failures.

Implementation: Semaphore-Based Bulkhead

java
import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit;
import java.util.function.Supplier;

public class Bulkhead<T> {

    private final String name;
    private final Semaphore semaphore;
    private final long timeoutMs;

    public Bulkhead(String name, int maxConcurrent, long timeoutMs) {
        this.name = name;
        this.semaphore = new Semaphore(maxConcurrent);
        this.timeoutMs = timeoutMs;
    }

    public T execute(Supplier<T> operation, Supplier<T> fallback) {
        boolean acquired = false;
        try {
            acquired = semaphore.tryAcquire(timeoutMs, TimeUnit.MILLISECONDS);
            if (!acquired) {
                System.out.printf("[BULKHEAD:%s] Rejected — no permits available (%d queued)%n",
                        name, semaphore.getQueueLength());
                return fallback.get();
            }
            System.out.printf("[BULKHEAD:%s] Executing — %d/%d permits in use%n",
                    name, maxConcurrent() - semaphore.availablePermits(), maxConcurrent());
            return operation.get();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            return fallback.get();
        } catch (Exception e) {
            System.out.printf("[BULKHEAD:%s] Operation failed: %s%n", name, e.getMessage());
            return fallback.get();
        } finally {
            if (acquired) {
                semaphore.release();
            }
        }
    }

    private int maxConcurrent() {
        return semaphore.availablePermits() + semaphore.getQueueLength() +
                (semaphore.availablePermits() == 0 ? 0 : 0); // simplified
    }

    public int availablePermits() {
        return semaphore.availablePermits();
    }

    public static void main(String[] args) throws InterruptedException {
        Bulkhead<String> paymentBulkhead = new Bulkhead<>("PaymentService", 3, 1000);
        Bulkhead<String> inventoryBulkhead = new Bulkhead<>("InventoryService", 5, 500);

        // Simulate concurrent requests
        Runnable task = () -> {
            String result = paymentBulkhead.execute(
                () -> {
                    try { Thread.sleep(2000); } catch (InterruptedException e) { }
                    return "Payment processed";
                },
                () -> "Payment queued for retry"
            );
            System.out.println(Thread.currentThread().getName() + ": " + result);
        };

        // Launch 6 concurrent requests (only 3 can execute at once)
        for (int i = 0; i < 6; i++) {
            new Thread(task, "Request-" + i).start();
        }

        Thread.sleep(5000);
        System.out.println("Remaining permits: " + paymentBulkhead.availablePermits());
    }
}

Timeout Pattern

Timeouts define the maximum duration a caller is willing to wait for a response. Without timeouts, threads can be held indefinitely by unresponsive services, eventually starving the entire system.

Implementation: Timeout Wrapper with CompletableFuture

java
import java.util.concurrent.*;
import java.util.function.Supplier;

public class TimeoutPolicy<T> {

    private final long timeoutMs;
    private final ExecutorService executor;

    public TimeoutPolicy(long timeoutMs) {
        this.timeoutMs = timeoutMs;
        this.executor = Executors.newCachedThreadPool();
    }

    public T execute(Supplier<T> operation, Supplier<T> fallback) {
        CompletableFuture<T> future = CompletableFuture.supplyAsync(operation, executor);

        try {
            return future.get(timeoutMs, TimeUnit.MILLISECONDS);
        } catch (TimeoutException e) {
            future.cancel(true);
            System.out.printf("[TIMEOUT] Operation exceeded %dms — using fallback%n", timeoutMs);
            return fallback.get();
        } catch (ExecutionException e) {
            System.out.printf("[TIMEOUT] Operation failed: %s%n", e.getCause().getMessage());
            return fallback.get();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            return fallback.get();
        }
    }

    public void shutdown() {
        executor.shutdown();
    }

    public static void main(String[] args) {
        TimeoutPolicy<String> timeout = new TimeoutPolicy<>(2000);

        // Fast operation — succeeds
        String fast = timeout.execute(
            () -> { sleep(500); return "Fast result"; },
            () -> "Fallback"
        );
        System.out.println("Fast: " + fast);

        // Slow operation — times out
        String slow = timeout.execute(
            () -> { sleep(5000); return "Slow result"; },
            () -> "Timeout fallback"
        );
        System.out.println("Slow: " + slow);

        timeout.shutdown();
    }

    private static void sleep(long ms) {
        try { Thread.sleep(ms); } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }
}

Fallback Pattern

The fallback pattern provides an alternative response when the primary operation fails. Fallbacks can return cached data, default values, or responses from secondary services.

Composing Patterns: Resilience Pipeline

Real-world systems combine multiple resilience patterns into a pipeline. The order matters — typically Timeout → Retry → Circuit Breaker → Bulkhead → Fallback.

Implementation: Composable Resilience Pipeline

java
import java.time.Duration;
import java.util.function.Supplier;

public class ResiliencePipeline<T> {

    private final RetryPolicy<T> retryPolicy;
    private final CircuitBreaker<T> circuitBreaker;
    private final Bulkhead<T> bulkhead;
    private final TimeoutPolicy<T> timeoutPolicy;
    private final Supplier<T> fallback;

    private ResiliencePipeline(Builder<T> builder) {
        this.retryPolicy = builder.retryPolicy;
        this.circuitBreaker = builder.circuitBreaker;
        this.bulkhead = builder.bulkhead;
        this.timeoutPolicy = builder.timeoutPolicy;
        this.fallback = builder.fallback;
    }

    public T execute(Supplier<T> operation) {
        // Layer 1: Bulkhead
        return bulkhead.execute(
            () -> {
                // Layer 2: Circuit Breaker
                return circuitBreaker.execute(
                    () -> {
                        // Layer 3: Retry
                        try {
                            return retryPolicy.execute(
                                // Layer 4: Timeout
                                () -> timeoutPolicy.execute(operation, fallback)
                            );
                        } catch (Exception e) {
                            throw new RuntimeException(e);
                        }
                    },
                    fallback
                );
            },
            fallback
        );
    }

    public static <T> Builder<T> builder() {
        return new Builder<>();
    }

    public static class Builder<T> {
        private RetryPolicy<T> retryPolicy;
        private CircuitBreaker<T> circuitBreaker;
        private Bulkhead<T> bulkhead;
        private TimeoutPolicy<T> timeoutPolicy;
        private Supplier<T> fallback = () -> null;

        public Builder<T> retry(int maxRetries, Duration baseDelay) {
            this.retryPolicy = new RetryPolicy<>(maxRetries, baseDelay, baseDelay.multipliedBy(16));
            return this;
        }

        public Builder<T> circuitBreaker(int failureThreshold, Duration openDuration) {
            this.circuitBreaker = new CircuitBreaker<>(failureThreshold, openDuration);
            return this;
        }

        public Builder<T> bulkhead(String name, int maxConcurrent) {
            this.bulkhead = new Bulkhead<>(name, maxConcurrent, 5000);
            return this;
        }

        public Builder<T> timeout(Duration timeout) {
            this.timeoutPolicy = new TimeoutPolicy<>(timeout.toMillis());
            return this;
        }

        public Builder<T> fallback(Supplier<T> fallback) {
            this.fallback = fallback;
            return this;
        }

        public ResiliencePipeline<T> build() {
            return new ResiliencePipeline<>(this);
        }
    }

    public static void main(String[] args) {
        ResiliencePipeline<String> pipeline = ResiliencePipeline.<String>builder()
                .bulkhead("OrderService", 10)
                .circuitBreaker(5, Duration.ofSeconds(30))
                .retry(3, Duration.ofMillis(200))
                .timeout(Duration.ofSeconds(3))
                .fallback(() -> "Order queued for async processing")
                .build();

        // Execute with full resilience
        String result = pipeline.execute(() -> {
            // Simulate calling a remote order service
            System.out.println("Calling order service...");
            return "Order #12345 confirmed";
        });

        System.out.println("Final result: " + result);
    }
}

Rate Limiting

Rate limiting protects services from being overwhelmed by too many requests. It is both a resilience pattern (protecting the server) and a fairness mechanism (preventing one client from starving others).

Implementation: Token Bucket Rate Limiter

java
import java.time.Instant;
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.AtomicReference;

public class TokenBucketRateLimiter {

    private final long capacity;
    private final double refillRatePerSecond;
    private final AtomicLong tokens;
    private final AtomicReference<Instant> lastRefill;

    public TokenBucketRateLimiter(long capacity, double refillRatePerSecond) {
        this.capacity = capacity;
        this.refillRatePerSecond = refillRatePerSecond;
        this.tokens = new AtomicLong(capacity);
        this.lastRefill = new AtomicReference<>(Instant.now());
    }

    public synchronized boolean tryAcquire() {
        refill();
        if (tokens.get() > 0) {
            tokens.decrementAndGet();
            return true;
        }
        return false;
    }

    private void refill() {
        Instant now = Instant.now();
        Instant last = lastRefill.get();
        double elapsedSeconds = (now.toEpochMilli() - last.toEpochMilli()) / 1000.0;
        long newTokens = (long) (elapsedSeconds * refillRatePerSecond);

        if (newTokens > 0) {
            tokens.set(Math.min(capacity, tokens.get() + newTokens));
            lastRefill.set(now);
        }
    }

    public static void main(String[] args) throws InterruptedException {
        // 5 capacity, refills 2 tokens/second
        TokenBucketRateLimiter limiter = new TokenBucketRateLimiter(5, 2.0);

        // Burst of 8 requests
        for (int i = 1; i <= 8; i++) {
            boolean allowed = limiter.tryAcquire();
            System.out.printf("Request %d: %s%n", i, allowed ? "✅ ALLOWED" : "❌ REJECTED");
        }

        // Wait for refill
        System.out.println("\nWaiting 2 seconds for refill...");
        Thread.sleep(2000);

        for (int i = 9; i <= 12; i++) {
            boolean allowed = limiter.tryAcquire();
            System.out.printf("Request %d: %s%n", i, allowed ? "✅ ALLOWED" : "❌ REJECTED");
        }
    }
}

Pattern Decision Matrix

Choosing the right pattern depends on the failure mode you're defending against:

Monitoring Resilience Patterns

Patterns are only effective when you can observe their behavior in production. Key metrics to track:

Best Practices

  1. Always set timeouts: Every external call must have a timeout. An absent timeout is an unbounded liability that can lock resources indefinitely.

  2. Use exponential backoff with jitter for retries: Fixed-interval retries create thundering herds. Jitter distributes retry load evenly and reduces contention.

  3. Only retry idempotent operations: Retrying a non-idempotent POST can create duplicate orders. Ensure operations are safe to repeat or use idempotency keys.

  4. Tune circuit breaker thresholds based on SLAs: A threshold too low causes false trips; too high allows cascading failures. Base thresholds on historical error rates.

  5. Size bulkheads based on capacity planning: Each bulkhead partition needs enough resources to handle its expected load, plus headroom for bursts.

  6. Implement fallbacks with business context: A fallback returning empty data might be worse than an error. Design fallbacks that preserve the user's intent — cache recent data, queue for later, or show meaningful degraded state.

  7. Monitor pattern activations as signals: A spike in circuit breaker trips or retry rates is an early warning. Alert on pattern activation rates, not just downstream errors.

  8. Test resilience with chaos engineering: Inject failures in staging and production to verify that patterns work as designed. Tools like Chaos Monkey validate real-world behavior.

  9. Compose patterns in the correct order: The standard layering is Bulkhead → Circuit Breaker → Retry → Timeout. Incorrect ordering can negate the benefits.

  10. Avoid retry amplification: If Service A retries 3× calling Service B which retries 3× calling Service C, one failure produces 9 calls. Set a retry budget across the call chain.