Circuit Breakers and Bulkheads

Preventing cascading failures:

Circuit Breaker Pattern:

  • Monitors for failures
  • Trips when failure threshold reached
  • Prevents cascading failures
  • Allows periodic recovery attempts
  • Provides fallback mechanisms
  • Improves system stability
  • Enables graceful degradation

Circuit Breaker States:

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│               │     │               │     │               │
│    Closed     │────▶│     Open      │────▶│  Half-Open    │
│  (Normal)     │     │  (Failing)    │     │  (Testing)    │
│               │     │               │     │               │
└───────────────┘     └───────────────┘     └───────────────┘
       ▲                                            │
       └────────────────────────────────────────────┘

Example Circuit Breaker Implementation (Java):

// Resilience4j Circuit Breaker example
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.vavr.control.Try;

import java.time.Duration;
import java.util.function.Supplier;

public class OrderService {
    private final PaymentService paymentService;
    private final CircuitBreaker circuitBreaker;
    
    public OrderService(PaymentService paymentService) {
        this.paymentService = paymentService;
        
        // Configure the circuit breaker
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                 // 50% failure rate to trip
            .waitDurationInOpenState(Duration.ofSeconds(10)) // Wait 10s before testing
            .ringBufferSizeInHalfOpenState(5)         // Number of calls in half-open state
            .ringBufferSizeInClosedState(10)          // Number of calls in closed state
            .automaticTransitionFromOpenToHalfOpenEnabled(true)
            .build();
        
        this.circuitBreaker = CircuitBreaker.of("paymentService", config);
    }
    
    public PaymentResult processPayment(Order order) {
        // Decorate the payment service call with circuit breaker
        Supplier<PaymentResult> decoratedSupplier = CircuitBreaker
            .decorateSupplier(circuitBreaker, () -> paymentService.processPayment(order));
        
        // Execute the call with fallback
        return Try.ofSupplier(decoratedSupplier)
            .recover(e -> fallbackPaymentMethod(order))
            .get();
    }
    
    private PaymentResult fallbackPaymentMethod(Order order) {
        // Fallback logic when payment service is unavailable
        return new PaymentResult(
            PaymentStatus.PENDING,
            "Payment queued for processing",
            order.getId()
        );
    }
}

Bulkhead Pattern:

  • Isolates components and failures
  • Prevents resource exhaustion
  • Limits concurrent calls
  • Compartmentalizes failures
  • Improves fault tolerance
  • Enables partial availability
  • Protects critical services

Example Bulkhead Implementation (Java):

// Resilience4j Bulkhead example
import io.github.resilience4j.bulkhead.Bulkhead;
import io.github.resilience4j.bulkhead.BulkheadConfig;
import io.vavr.control.Try;

import java.time.Duration;
import java.util.function.Supplier;

public class ApiGateway {
    private final UserService userService;
    private final OrderService orderService;
    private final InventoryService inventoryService;
    
    private final Bulkhead userServiceBulkhead;
    private final Bulkhead orderServiceBulkhead;
    private final Bulkhead inventoryServiceBulkhead;
    
    public ApiGateway(
            UserService userService,
            OrderService orderService,
            InventoryService inventoryService) {
        this.userService = userService;
        this.orderService = orderService;
        this.inventoryService = inventoryService;
        
        // Configure bulkheads with different capacities based on criticality
        BulkheadConfig userConfig = BulkheadConfig.custom()
            .maxConcurrentCalls(20)
            .maxWaitDuration(Duration.ofMillis(500))
            .build();
        
        BulkheadConfig orderConfig = BulkheadConfig.custom()
            .maxConcurrentCalls(30)
            .maxWaitDuration(Duration.ofMillis(1000))
            .build();
        
        BulkheadConfig inventoryConfig = BulkheadConfig.custom()
            .maxConcurrentCalls(10)
            .maxWaitDuration(Duration.ofMillis(200))
            .build();
        
        this.userServiceBulkhead = Bulkhead.of("userService", userConfig);
        this.orderServiceBulkhead = Bulkhead.of("orderService", orderConfig);
        this.inventoryServiceBulkhead = Bulkhead.of("inventoryService", inventoryConfig);
    }
    
    public UserProfile getUserProfile(String userId) {
        Supplier<UserProfile> decoratedSupplier = Bulkhead
            .decorateSupplier(userServiceBulkhead, () -> userService.getProfile(userId));
        
        return Try.ofSupplier(decoratedSupplier)
            .recover(e -> new UserProfile(userId, "Unknown", "Guest"))
            .get();
    }
    
    public OrderDetails getOrderDetails(String orderId) {
        Supplier<OrderDetails> decoratedSupplier = Bulkhead
            .decorateSupplier(orderServiceBulkhead, () -> orderService.getDetails(orderId));
        
        return Try.ofSupplier(decoratedSupplier)
            .recover(e -> new OrderDetails(orderId, OrderStatus.UNKNOWN))
            .get();
    }
}

Retry and Backoff Strategies

Handling transient failures:

Retry Pattern:

  • Automatically retry failed operations
  • Handle transient failures
  • Improve success probability
  • Implement retry limits
  • Use appropriate backoff strategies
  • Consider idempotency requirements
  • Monitor retry metrics

Backoff Strategies:

  • Constant backoff
  • Linear backoff
  • Exponential backoff
  • Exponential backoff with jitter
  • Decorrelated jitter
  • Random backoff

Example Retry with Exponential Backoff (Python):

# Python retry with exponential backoff
import random
import time
from functools import wraps

def retry_with_exponential_backoff(
    max_retries=5,
    base_delay_ms=100,
    max_delay_ms=30000,
    jitter=True
):
    """Retry decorator with exponential backoff."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            retries = 0
            while True:
                try:
                    return func(*args, **kwargs)
                except (ConnectionError, TimeoutError) as e:
                    retries += 1
                    if retries > max_retries:
                        raise Exception(f"Failed after {max_retries} retries") from e
                    
                    # Calculate delay with exponential backoff
                    delay_ms = min(base_delay_ms * (2 ** (retries - 1)), max_delay_ms)
                    
                    # Add jitter to prevent thundering herd
                    if jitter:
                        delay_ms = random.uniform(0, delay_ms * 1.5)
                    
                    print(f"Retry {retries}/{max_retries} after {delay_ms:.2f}ms")
                    time.sleep(delay_ms / 1000)
        return wrapper
    return decorator

@retry_with_exponential_backoff()
def fetch_data_from_api(url):
    """Fetch data from an API with retry capability."""
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    return response.json()

Retry Considerations:

  • Idempotency of operations
  • Retry budget and limits
  • Timeout configurations
  • Failure categorization
  • Retry storm prevention
  • Circuit breaker integration
  • Monitoring and alerting