Circuit Breakers and Resilience

When external services fail, they often fail spectacularly. Without protection, one failing API can bring down your entire async application as requests pile up waiting for timeouts. Circuit breakers act like electrical fuses - they trip when things go wrong, protecting your system from cascading failures.

Basic Circuit Breaker

Build a simple but effective circuit breaker:

import asyncio
import time
from enum import Enum
from typing import Callable, Any

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Blocking requests
    HALF_OPEN = "half_open"  # Testing if service recovered

A circuit breaker has three states: closed (normal), open (blocking requests), and half-open (testing recovery). The state transitions based on success/failure patterns.

Here’s the core circuit breaker logic:

class CircuitBreaker:
    def __init__(self, 
                 failure_threshold: int = 5,
                 timeout: float = 60.0,
                 expected_exception: type = Exception):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.expected_exception = expected_exception
        
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

The circuit breaker tracks failures and automatically opens when the threshold is exceeded. After a timeout period, it enters half-open state to test if the service has recovered.

The main execution method handles state transitions:

    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with circuit breaker protection"""
        
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = await func(*args, **kwargs)
            self._on_success()
            return result
            
        except self.expected_exception as e:
            self._on_failure()
            raise e

When the circuit is open, it either blocks the request or allows one test request if enough time has passed.

The helper methods manage state transitions:

    def _should_attempt_reset(self) -> bool:
        """Check if enough time has passed to attempt reset"""
        return (time.time() - self.last_failure_time) >= self.timeout
    
    def _on_success(self):
        """Handle successful call"""
        self.failure_count = 0
        self.state = CircuitState.CLOSED
    
    def _on_failure(self):
        """Handle failed call"""
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage
circuit_breaker = CircuitBreaker(failure_threshold=3, timeout=30.0)

async def unreliable_service():
    """Simulate an unreliable external service"""
    import random
    if random.random() < 0.7:  # 70% failure rate
        raise Exception("Service unavailable")
    return "Success"

Advanced Circuit Breaker with Metrics

Add monitoring and metrics:

import asyncio
import time
from dataclasses import dataclass

@dataclass
class CircuitBreakerMetrics:
    total_requests: int = 0
    successful_requests: int = 0
    failed_requests: int = 0

class AdvancedCircuitBreaker:
    def __init__(self, failure_threshold: int = 5, timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
        self.metrics = CircuitBreakerMetrics()
    
    async def call(self, func, *args, **kwargs):
        """Execute function with advanced circuit breaker protection"""
        self.metrics.total_requests += 1
        
        if self.state == CircuitState.OPEN:
            if (time.time() - self.last_failure_time) >= self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = await func(*args, **kwargs)
            self.metrics.successful_requests += 1
            self.failure_count = 0
            self.state = CircuitState.CLOSED
            return result
            
        except Exception as e:
            self.metrics.failed_requests += 1
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            
            raise e
    
    def get_metrics(self):
        """Get circuit breaker metrics"""
        return {
            "state": self.state.value,
            "total_requests": self.metrics.total_requests,
            "success_rate": (
                self.metrics.successful_requests / self.metrics.total_requests
                if self.metrics.total_requests > 0 else 0
            )
        }

Retry with Exponential Backoff

Combine circuit breakers with retry logic:

import asyncio
import random

class RetryableCircuitBreaker:
    def __init__(self, 
                 max_retries: int = 3,
                 base_delay: float = 1.0,
                 failure_threshold: int = 5):
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.circuit_breaker = AdvancedCircuitBreaker(failure_threshold=failure_threshold)
    
    async def call_with_retry(self, func, *args, **kwargs):
        """Execute function with retry and circuit breaker protection"""
        
        for attempt in range(self.max_retries + 1):
            try:
                return await self.circuit_breaker.call(func, *args, **kwargs)
                
            except Exception as e:
                if attempt == self.max_retries:
                    raise e
                
                # Exponential backoff with jitter
                delay = self.base_delay * (2 ** attempt)
                jitter = random.uniform(0, delay * 0.1)
                
                await asyncio.sleep(delay + jitter)

# Usage
retryable_breaker = RetryableCircuitBreaker(max_retries=3, base_delay=1.0)

async def flaky_service():
    if random.random() < 0.5:
        raise Exception("Temporary failure")
    return "Success"

Making Circuit Breakers Work in Real Systems

Lessons learned from implementing resilience patterns:

Circuit Breaker Design:

  • Set appropriate failure thresholds based on service characteristics
  • Use different timeouts for different services
  • Monitor circuit breaker state and metrics
  • Implement graceful degradation when circuits are open

Retry Strategy:

  • Use exponential backoff with jitter
  • Set maximum retry limits
  • Don’t retry on certain error types (4xx HTTP errors)
  • Combine with circuit breakers for better protection

Resource Isolation:

  • Use bulkheads to isolate different resource types
  • Set appropriate concurrency limits
  • Monitor resource utilization
  • Implement separate thread pools for different operations

Summary

Resilience pattern essentials:

  • Implement circuit breakers to prevent cascading failures
  • Use retry with exponential backoff for transient failures
  • Apply bulkhead pattern to isolate resources
  • Set appropriate timeouts for all operations
  • Monitor metrics and adjust thresholds based on behavior
  • Combine patterns for comprehensive resilience

These patterns ensure your async applications remain stable and responsive even when dependencies fail.

In Part 14, we’ll explore state machines and observer patterns.