Production Deployment and Best Practices
Deploying circuit breakers in production requires careful consideration of configuration, testing, and operational practices.
Configuration Best Practices
Circuit breakers should be configurable to adapt to different environments:
package config
import (
"encoding/json"
"os"
"time"
"example.com/circuitbreaker"
)
// CircuitBreakerConfig represents configuration for a circuit breaker
type CircuitBreakerConfig struct {
FailureThreshold int `json:"failureThreshold"`
ResetTimeout string `json:"resetTimeout"`
FailureRateThreshold float64 `json:"failureRateThreshold"`
MinimumRequests int `json:"minimumRequests"`
WindowSize string `json:"windowSize"`
Timeout string `json:"timeout"`
}
// ResilienceConfig represents configuration for resilience patterns
type ResilienceConfig struct {
CircuitBreakers map[string]CircuitBreakerConfig `json:"circuitBreakers"`
Bulkheads map[string]int `json:"bulkheads"`
}
// LoadConfig loads configuration from a file
func LoadConfig(filename string) (*ResilienceConfig, error) {
file, err := os.Open(filename)
if err != nil {
return nil, err
}
defer file.Close()
config := &ResilienceConfig{}
if err := json.NewDecoder(file).Decode(config); err != nil {
return nil, err
}
return config, nil
}
// CreateCircuitBreaker creates a circuit breaker from configuration
func CreateCircuitBreaker(config CircuitBreakerConfig) (circuitbreaker.CircuitBreaker, error) {
resetTimeout, err := time.ParseDuration(config.ResetTimeout)
if err != nil {
return nil, err
}
windowSize, err := time.ParseDuration(config.WindowSize)
if err != nil {
return nil, err
}
timeout, err := time.ParseDuration(config.Timeout)
if err != nil {
return nil, err
}
// Choose the appropriate circuit breaker implementation based on configuration
if config.FailureRateThreshold > 0 {
return circuitbreaker.NewSlidingWindowCircuitBreaker(
windowSize,
config.FailureRateThreshold,
config.MinimumRequests,
resetTimeout,
), nil
}
return circuitbreaker.NewSimpleCircuitBreaker(
config.FailureThreshold,
resetTimeout,
), nil
}
Example configuration file:
{
"circuitBreakers": {
"userService": {
"failureThreshold": 5,
"resetTimeout": "30s",
"timeout": "2s"
},
"paymentService": {
"failureRateThreshold": 0.25,
"minimumRequests": 10,
"windowSize": "1m",
"resetTimeout": "1m",
"timeout": "5s"
},
"inventoryService": {
"failureThreshold": 3,
"resetTimeout": "15s",
"timeout": "1s"
}
},
"bulkheads": {
"userService": 50,
"paymentService": 20,
"inventoryService": 30
}
}
Testing Circuit Breakers
Testing circuit breakers requires simulating failure scenarios:
package testing
import (
"errors"
"testing"
"time"
"math/rand"
"example.com/circuitbreaker"
)
// TestSimpleCircuitBreaker tests the basic functionality of a circuit breaker
func TestSimpleCircuitBreaker(t *testing.T) {
// Create a circuit breaker that trips after 3 failures and resets after 100ms
cb := circuitbreaker.NewSimpleCircuitBreaker(3, 100*time.Millisecond)
// Function that always fails
failingFunc := func() error {
return errors.New("simulated failure")
}
// Function that always succeeds
successFunc := func() error {
return nil
}
// Test initial state
if cb.State() != circuitbreaker.Closed {
t.Errorf("Initial state should be Closed, got %v", cb.State())
}
// Test that circuit opens after threshold failures
for i := 0; i < 3; i++ {
err := cb.Execute(failingFunc)
if err == nil {
t.Errorf("Expected error, got nil")
}
}
// Circuit should now be open
if cb.State() != circuitbreaker.Open {
t.Errorf("State should be Open after failures, got %v", cb.State())
}
// Test that requests are rejected when circuit is open
err := cb.Execute(successFunc)
if err != circuitbreaker.ErrCircuitOpen {
t.Errorf("Expected ErrCircuitOpen, got %v", err)
}
// Wait for reset timeout
time.Sleep(150 * time.Millisecond)
// Circuit should now be half-open
if !cb.AllowRequest() {
t.Errorf("Circuit should allow a request after reset timeout")
}
// Test that circuit closes after success in half-open state
err = cb.Execute(successFunc)
if err != nil {
t.Errorf("Expected success, got %v", err)
}
// Circuit should now be closed
if cb.State() != circuitbreaker.Closed {
t.Errorf("State should be Closed after success, got %v", cb.State())
}
}
// TestCircuitBreakerWithChaos tests circuit breaker behavior with chaos testing
func TestCircuitBreakerWithChaos(t *testing.T) {
if testing.Short() {
t.Skip("Skipping chaos test in short mode")
}
// Create a sliding window circuit breaker
cb := circuitbreaker.NewSlidingWindowCircuitBreaker(
1*time.Second, // 1 second window
0.5, // 50% failure threshold
10, // Minimum 10 requests
500*time.Millisecond, // 500ms reset timeout
)
// Create a function with variable failure rate
var failureRate float64 = 0.0
testFunc := func() error {
if rand.Float64() < failureRate {
return errors.New("simulated failure")
}
return nil
}
// Run test for 10 seconds
start := time.Now()
end := start.Add(10 * time.Second)
// Track statistics
requests := 0
failures := 0
rejections := 0
for time.Now().Before(end) {
// Adjust failure rate over time to simulate service degradation and recovery
elapsed := time.Since(start).Seconds()
switch {
case elapsed < 2:
failureRate = 0.1 // 10% failures initially
case elapsed < 4:
failureRate = 0.6 // 60% failures - should trip circuit
case elapsed < 6:
failureRate = 0.7 // 70% failures - circuit should stay open
case elapsed < 8:
failureRate = 0.2 // 20% failures - circuit should recover
default:
failureRate = 0.1 // 10% failures - circuit should stay closed
}
requests++
err := cb.Execute(testFunc)
if err != nil {
if errors.Is(err, circuitbreaker.ErrCircuitOpen) {
rejections++
} else {
failures++
}
}
// Small delay between requests
time.Sleep(10 * time.Millisecond)
}
// Log statistics
t.Logf("Total requests: %d", requests)
t.Logf("Failures: %d (%.2f%%)", failures, float64(failures)/float64(requests)*100)
t.Logf("Rejections: %d (%.2f%%)", rejections, float64(rejections)/float64(requests)*100)
// Verify circuit breaker protected the system during high failure rates
if rejections == 0 {
t.Errorf("Circuit breaker should have rejected some requests during high failure rates")
}
}
Deployment Strategies
Deploying circuit breakers requires careful consideration of dependencies and failure modes:
package main
import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
"example.com/circuitbreaker"
"example.com/config"
"example.com/health"
"example.com/metrics"
)
func main() {
// Load configuration
cfg, err := config.LoadConfig("resilience.json")
if err != nil {
log.Fatalf("Failed to load configuration: %v", err)
}
// Create circuit breakers
circuitBreakers := make(map[string]circuitbreaker.CircuitBreaker)
for name, cbConfig := range cfg.CircuitBreakers {
cb, err := config.CreateCircuitBreaker(cbConfig)
if err != nil {
log.Fatalf("Failed to create circuit breaker %s: %v", name, err)
}
circuitBreakers[name] = cb
}
// Set up metrics
promCollector := metrics.NewPrometheusCircuitBreakerCollector()
// Instrument circuit breakers
instrumentedBreakers := make(map[string]circuitbreaker.CircuitBreaker)
for name, cb := range circuitBreakers {
metrics := metrics.NewCircuitBreakerMetrics(name)
promCollector.RegisterMetrics(name, metrics)
instrumentedBreakers[name] = metrics.NewInstrumentedCircuitBreaker(cb, metrics)
}
// Set up health checks
healthCheck := health.NewCircuitBreakerHealth()
for name, cb := range instrumentedBreakers {
healthCheck.RegisterCircuitBreaker(name, cb)
}
// Set up HTTP handlers
metrics.SetupPrometheusHandler()
health.SetupHealthCheckHandler(healthCheck)
// Start metrics updater
go func() {
ticker := time.NewTicker(15 * time.Second)
defer ticker.Stop()
for range ticker.C {
promCollector.UpdateMetrics()
}
}()
// Start HTTP server
server := &http.Server{
Addr: ":8080",
}
// Graceful shutdown
go func() {
signals := make(chan os.Signal, 1)
signal.Notify(signals, syscall.SIGINT, syscall.SIGTERM)
<-signals
log.Println("Shutting down...")
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
log.Printf("HTTP server shutdown error: %v", err)
}
}()
// Start server
log.Println("Starting server on :8080")
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("HTTP server error: %v", err)
}
}
Production Best Practices
Here are key best practices for using circuit breakers in production:
-
Start Conservative: Begin with higher failure thresholds and shorter reset timeouts, then adjust based on observed behavior.
-
Granular Circuit Breakers: Use separate circuit breakers for different dependencies and even different operations on the same dependency.
-
Fallbacks: Always implement fallbacks for when the circuit is open:
// Example fallback strategy
func getUserWithFallback(userID string) (*User, error) {
// Try primary data source with circuit breaker
var user *User
err := userServiceCB.Execute(func() error {
var err error
user, err = userService.GetUser(userID)
return err
})
// If circuit is open or request failed, try fallback
if err != nil {
// Try cache
cachedUser, cacheErr := userCache.Get(userID)
if cacheErr == nil {
return cachedUser, nil
}
// Try default
return &User{
ID: userID,
Name: "Unknown User",
IsGuest: true,
}, nil
}
return user, nil
}
-
Monitor and Alert: Set up alerts for circuit breaker state changes and high rejection rates:
// Example Prometheus alert rule
groups:
- name: CircuitBreakerAlerts
rules:
- alert: CircuitBreakerOpen
expr: circuit_breaker_state{name=~".*"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
description: "The circuit breaker {{ $labels.name }} has been open for 5 minutes."
- alert: HighRejectionRate
expr: rate(circuit_breaker_rejected_total{name=~".*"}[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High rejection rate for {{ $labels.name }}"
description: "Circuit breaker {{ $labels.name }} is rejecting more than 10 requests per second."
- Tune Parameters: Adjust circuit breaker parameters based on observed behavior:
// Example adaptive parameter tuning
func tuneCircuitBreakerParameters(cb *AdaptiveCircuitBreaker, metrics *CircuitBreakerMetrics) {
// Check metrics every minute
ticker := time.NewTicker(1 * time.Minute)
defer ticker.Stop()
for range ticker.C {
errorRate := metrics.CalculateErrorRate()
rejectionRate := float64(metrics.rejectedCount) / float64(metrics.successCount + metrics.failureCount + metrics.rejectedCount)
// If error rate is very low but rejection rate is high, make the circuit breaker more lenient
if errorRate < 0.01 && rejectionRate > 0.1 {
threshold, timeout := cb.CurrentThresholds()
cb.SetThresholds(threshold+1, timeout*0.9)
log.Printf("Tuned circuit breaker to be more lenient: threshold=%d, timeout=%v", threshold+1, timeout*0.9)
}
// If error rate is very high, make the circuit breaker more strict
if errorRate > 0.3 {
threshold, timeout := cb.CurrentThresholds()
cb.SetThresholds(threshold-1, timeout*1.1)
log.Printf("Tuned circuit breaker to be more strict: threshold=%d, timeout=%v", threshold-1, timeout*1.1)
}
}
}
- Graceful Degradation: Design systems to function with reduced capabilities when dependencies are unavailable:
// Example service with graceful degradation
type RecommendationService struct {
productService *ProductService
productServiceCB circuitbreaker.CircuitBreaker
userService *UserService
userServiceCB circuitbreaker.CircuitBreaker
analyticsService *AnalyticsService
analyticsServiceCB circuitbreaker.CircuitBreaker
}
// GetRecommendations returns product recommendations for a user
func (s *RecommendationService) GetRecommendations(userID string) ([]Product, error) {
// Try to get personalized recommendations using all services
if s.userServiceCB.AllowRequest() && s.analyticsServiceCB.AllowRequest() {
var user *User
var userHistory []PurchaseHistory
// Get user data
userErr := s.userServiceCB.Execute(func() error {
var err error
user, err = s.userService.GetUser(userID)
return err
})
// Get user history
historyErr := s.analyticsServiceCB.Execute(func() error {
var err error
userHistory, err = s.analyticsService.GetUserHistory(userID)
return err
})
// If both succeeded, generate personalized recommendations
if userErr == nil && historyErr == nil {
return s.generatePersonalizedRecommendations(user, userHistory)
}
}
// Fallback to category-based recommendations
if s.productServiceCB.AllowRequest() {
var categories []string
// Try to get user's preferred categories if user service is available
if s.userServiceCB.AllowRequest() {
s.userServiceCB.Execute(func() error {
var err error
categories, err = s.userService.GetUserPreferredCategories(userID)
return err
})
}
// If we have categories, use them; otherwise use popular categories
if len(categories) == 0 {
categories = []string{"popular", "trending"}
}
// Get recommendations by category
var products []Product
err := s.productServiceCB.Execute(func() error {
var err error
products, err = s.productService.GetProductsByCategories(categories, 10)
return err
})
if err == nil {
return products, nil
}
}
// Ultimate fallback: return hardcoded popular products
return s.getHardcodedPopularProducts(), nil
}
- Avoid Cascading Circuit Breakers: Be careful with circuit breakers that depend on each other:
// Example of problematic cascading circuit breakers
func problematicDesign() {
// Service A calls Service B which calls Service C
serviceC := NewService("C")
serviceCCircuitBreaker := circuitbreaker.NewSimpleCircuitBreaker(5, 30*time.Second)
serviceB := NewService("B")
serviceBCircuitBreaker := circuitbreaker.NewSimpleCircuitBreaker(5, 30*time.Second)
// Problem: If C fails, B's circuit breaker will open, then A's circuit breaker will open
// This creates a cascading effect where failures propagate upward
// Better design: Use different thresholds and timeouts at different levels
serviceCCircuitBreaker = circuitbreaker.NewSimpleCircuitBreaker(3, 10*time.Second)
serviceBCircuitBreaker = circuitbreaker.NewSimpleCircuitBreaker(5, 30*time.Second)
serviceACircuitBreaker := circuitbreaker.NewSimpleCircuitBreaker(7, 60*time.Second)
// Even better: Implement fallbacks at each level
}
- Circuit Breaker Patterns by Dependency Type: Tailor circuit breaker configurations to the type of dependency:
Dependency Type | Failure Threshold | Reset Timeout | Notes |
---|---|---|---|
Critical database | Higher (5-10) | Shorter (10-30s) | Essential service, try more aggressively to reconnect |
External API | Lower (3-5) | Longer (30-60s) | Less control, be more conservative |
Cache service | Very low (1-2) | Very short (5-10s) | Non-critical, fail fast and recover quickly |
Background job | Higher (8-10) | Medium (20-40s) | Can tolerate more failures before breaking |
What This Means
Circuit breakers are a critical component in the resilience toolkit for Go developers building distributed systems. By implementing these patterns, you can protect your applications from cascading failures, improve user experience during partial outages, and give failing dependencies time to recover.
In this guide, we’ve explored the fundamentals of circuit breaker patterns, implemented various circuit breaker strategies in Go, and examined how to integrate them with common microservice components. We’ve also covered advanced topics like monitoring, configuration, and production best practices.
Remember that circuit breakers are most effective when combined with other resilience patterns like timeouts, retries, bulkheads, and fallbacks. Together, these patterns form a comprehensive approach to building truly resilient Go applications that can withstand the unpredictable nature of distributed environments.
As you implement circuit breakers in your own applications, start with simple implementations and gradually add more sophisticated features as you observe their behavior in production. Monitor their performance, tune their parameters, and continuously refine your approach to achieve the optimal balance between reliability and resource utilization.
By mastering these advanced fault tolerance patterns, you’ll be well-equipped to build Go applications that not only survive in the face of failures but continue to provide value to users even when parts of the system are degraded.