Production Best Practices | Andrew Odendaal

Chaos Engineering

Proactively testing system resilience:

Chaos Engineering Principles:

Build a hypothesis around steady state
Vary real-world events
Run experiments in production
Minimize blast radius
Automate experiments
Learn and improve
Share results

Common Chaos Experiments:

Service instance failures
Dependency failures
Network latency injection
Network partition simulation
Resource exhaustion
Clock skew
Process termination
Region or zone outages

Example Chaos Experiment (Chaos Toolkit):

{
  "version": "1.0.0",
  "title": "Database connection failure resilience",
  "description": "Verify that the application can handle database connection failures gracefully",
  "tags": ["database", "resilience", "connection-pool"],
  "steady-state-hypothesis": {
    "title": "Application is healthy",
    "probes": [
      {
        "name": "api-health-check",
        "type": "probe",
        "tolerance": true,
        "provider": {
          "type": "http",
          "url": "https://api.example.com/health",
          "method": "GET",
          "timeout": 3,
          "status": 200
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "block-database-connection",
      "provider": {
        "type": "process",
        "path": "scripts/block-db-connection.sh"
      },
      "pauses": {
        "after": 10
      }
    },
    {
      "type": "probe",
      "name": "verify-fallback-mechanism",
      "provider": {
        "type": "http",
        "url": "https://api.example.com/orders/create",
        "method": "POST",
        "headers": {
          "Content-Type": "application/json"
        },
        "body": {
          "customerId": "customer-123",
          "items": [{"productId": "product-456", "quantity": 1}]
        },
        "timeout": 3,
        "status": 202
      }
    }
  ],
  "rollbacks": [
    {
      "type": "action",
      "name": "restore-database-connection",
      "provider": {
        "type": "process",
        "path": "scripts/restore-db-connection.sh"
      }
    }
  ]
}

Chaos Engineering Tools:

Chaos Monkey
Gremlin
Chaos Toolkit
Litmus
ChaosBlade
PowerfulSeal
Chaos Mesh
ToxiProxy

Resilience Testing Strategies

Verifying system behavior under failure:

Testing Approaches:

Unit testing with fault injection
Integration testing with simulated failures
Load testing with degraded resources
Fault injection testing
Recovery testing
Game days and disaster recovery drills
Continuous chaos testing

Example Resilience Unit Test (Java):

// JUnit test for circuit breaker behavior
@Test
public void testCircuitBreakerTripsAfterFailures() {
    // Create a mock payment service that fails
    PaymentService mockPaymentService = mock(PaymentService.class);
    when(mockPaymentService.processPayment(any(Order.class)))
        .thenThrow(new ServiceUnavailableException("Payment service unavailable"));
    
    // Create order service with circuit breaker
    OrderService orderService = new OrderService(mockPaymentService);
    Order testOrder = new Order("123", "customer-456", 99.99);
    
    // First few calls should attempt to call the service and then use fallback
    for (int i = 0; i < 5; i++) {
        PaymentResult result = orderService.processPayment(testOrder);
        assertEquals(PaymentStatus.PENDING, result.getStatus());
        assertEquals("Payment queued for processing", result.getMessage());
    }
    
    // Verify the mock was called the expected number of times
    verify(mockPaymentService, times(5)).processPayment(any(Order.class));
    
    // Additional calls should trip the circuit breaker and go straight to fallback
    // without calling the service
    for (int i = 0; i < 5; i++) {
        PaymentResult result = orderService.processPayment(testOrder);
        assertEquals(PaymentStatus.PENDING, result.getStatus());
    }
    
    // Verify no additional calls were made to the service
    verify(mockPaymentService, times(5)).processPayment(any(Order.class));
}

Operational Resilience

Observability for Resilience

Gaining visibility into distributed systems:

Observability Components:

Distributed tracing
Metrics collection
Structured logging
Health checks
Dependency monitoring
Error tracking
Performance monitoring

Key Resilience Metrics:

Error rates
Latency percentiles
Circuit breaker status
Retry counts
Fallback usage
Resource utilization
Dependency health
Recovery time

Example Distributed Tracing (OpenTelemetry):

// OpenTelemetry distributed tracing example
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Context;
import io.opentelemetry.context.Scope;

public class OrderController {
    private final OrderService orderService;
    private final Tracer tracer;
    
    public OrderController(OrderService orderService, OpenTelemetry openTelemetry) {
        this.orderService = orderService;
        this.tracer = openTelemetry.getTracer("com.example.orders");
    }
    
    public OrderResponse createOrder(HttpRequest request) {
        // Extract the context from the incoming request
        Context context = openTelemetry.getPropagators().getTextMapPropagator()
            .extract(Context.current(), request, new HttpRequestGetter());
        
        // Start a new span
        Span span = tracer.spanBuilder("createOrder")
            .setParent(context)
            .setSpanKind(SpanKind.SERVER)
            .startSpan();
        
        // Add attributes to the span
        span.setAttribute("http.method", request.getMethod());
        span.setAttribute("http.url", request.getUrl());
        
        try (Scope scope = span.makeCurrent()) {
            // Parse the request
            OrderRequest orderRequest = parseRequest(request);
            span.setAttribute("order.customerId", orderRequest.getCustomerId());
            
            // Create the order
            try {
                Order order = orderService.createOrder(orderRequest);
                span.setAttribute("order.id", order.getId());
                span.setAttribute("order.status", order.getStatus().toString());
                
                // Return success response
                return new OrderResponse(order.getId(), order.getStatus(), null);
            } catch (Exception e) {
                // Record the error
                span.recordException(e);
                span.setStatus(StatusCode.ERROR, e.getMessage());
                
                // Return error response
                return new OrderResponse(null, OrderStatus.FAILED, e.getMessage());
            }
        } finally {
            span.end();
        }
    }
}

Resilience in Practice

Real-world implementation strategies:

Resilience Implementation Checklist:

Identify critical paths and dependencies
Apply appropriate resilience patterns
Implement comprehensive monitoring
Establish failure detection mechanisms
Define recovery procedures
Test resilience regularly
Document resilience strategies
Train teams on failure response

Resilience Maturity Model:

Reactive: Respond to failures after they occur
Proactive: Implement basic resilience patterns
Preventative: Systematically identify and mitigate risks
Anticipatory: Proactively test and improve resilience
Adaptive: Self-healing systems that learn from failures

Example Resilience Architecture:

┌───────────────────────────────────────────────────────────┐
│                                                           │
│                    API Gateway                            │
│                                                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │             │  │             │  │             │        │
│  │  Rate       │  │  Auth       │  │  Request    │        │
│  │  Limiting   │  │  Service    │  │  Routing    │        │
│  │             │  │             │  │             │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                           │
└───────────────────────────────────────────────────────────┘
                 ▲                        ▲
                 │                        │
    ┌────────────┴─────────┐    ┌─────────┴────────────┐
    │                      │    │                      │
    ▼                      ▼    ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│                 │    │                 │    │                 │
│  User Service   │    │  Order Service  │    │  Product Service│
│                 │    │                 │    │                 │
│  Circuit Breaker│    │  Circuit Breaker│    │  Circuit Breaker│
│  Retry          │    │  Retry          │    │  Retry          │
│  Cache          │    │  Bulkhead       │    │  Cache          │
│                 │    │                 │    │                 │
└─────────┬───────┘    └─────────┬───────┘    └─────────┬───────┘
          │                      │                      │
          ▼                      ▼                      ▼
┌───────────────────────────────────────────────────────────┐
│                                                           │
│                    Data Layer                             │
│                                                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │             │  │             │  │             │        │
│  │  User DB    │  │  Order DB   │  │  Product DB │        │
│  │  (Replica)  │  │  (Sharded)  │  │  (Cached)   │        │
│  │             │  │             │  │             │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                           │
└───────────────────────────────────────────────────────────┘

Conclusion: Building Resilient Distributed Systems

Distributed systems failures are inevitable, but their impact on users doesn’t have to be. By understanding failure modes, implementing appropriate resilience patterns, testing systematically, and establishing operational practices that embrace failure, organizations can build systems that maintain availability and correctness despite adverse conditions.

Key takeaways from this guide include:

Understand Failure Modes: Recognize the many ways distributed systems can fail and design accordingly
Apply Resilience Patterns: Implement circuit breakers, bulkheads, retries, and other patterns to handle failures gracefully
Test Proactively: Use chaos engineering and resilience testing to verify system behavior under failure conditions
Embrace Observability: Implement comprehensive monitoring to detect and diagnose failures quickly
Design for Graceful Degradation: Ensure systems can continue providing value even when components fail

By applying these principles and leveraging the techniques discussed in this guide, you can build distributed systems that not only survive failures but continue to deliver value to users even under adverse conditions—turning the challenge of distributed systems complexity into an opportunity for enhanced reliability and user experience.

Continue Your Learning

This is part 5 of 5 in the comprehensive guide.

← Previous Implementation Strategies Guide Overview See all 5 parts

Guide Complete!

You've finished all 5 parts of this guide.

Explore More Browse other guides