Understanding Distributed Systems Failures

Failure Modes and Models

Recognizing what can go wrong:

Common Failure Modes:

  • Hardware failures
  • Network partitions
  • Service dependencies failures
  • Resource exhaustion
  • Data corruption
  • Clock skew
  • Configuration errors
  • Deployment failures
  • Cascading failures
  • Thundering herd problems

The Fallacies of Distributed Computing:

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Failure Models:

  • Fail-stop: Components fail by halting
  • Crash-recovery: Components fail and may restart
  • Omission: Components fail to respond
  • Byzantine: Components behave arbitrarily or maliciously
  • Timing: Components respond too early or too late

Example Network Partition Scenario:

Before Partition:
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│               │     │               │     │               │
│  Service A    │────▶│  Service B    │────▶│  Service C    │
│  (Region 1)   │     │  (Region 2)   │     │  (Region 1)   │
│               │     │               │     │               │
└───────────────┘     └───────────────┘     └───────────────┘

After Partition:
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│               │     │               │     │               │
│  Service A    │  ╳  │  Service B    │  ╳  │  Service C    │
│  (Region 1)   │     │  (Region 2)   │     │  (Region 1)   │
│               │     │               │     │               │
└───────────────┘     └───────────────┘     └───────────────┘
                           ↑
                           │
                      Network Partition
                      Between Regions