Understanding Distributed Systems Failures
Failure Modes and Models
Recognizing what can go wrong:
Common Failure Modes:
- Hardware failures
- Network partitions
- Service dependencies failures
- Resource exhaustion
- Data corruption
- Clock skew
- Configuration errors
- Deployment failures
- Cascading failures
- Thundering herd problems
The Fallacies of Distributed Computing:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Failure Models:
- Fail-stop: Components fail by halting
- Crash-recovery: Components fail and may restart
- Omission: Components fail to respond
- Byzantine: Components behave arbitrarily or maliciously
- Timing: Components respond too early or too late
Example Network Partition Scenario:
Before Partition:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ │ │ │ │ │
│ Service A │────▶│ Service B │────▶│ Service C │
│ (Region 1) │ │ (Region 2) │ │ (Region 1) │
│ │ │ │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
After Partition:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ │ │ │ │ │
│ Service A │ ╳ │ Service B │ ╳ │ Service C │
│ (Region 1) │ │ (Region 2) │ │ (Region 1) │
│ │ │ │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
↑
│
Network Partition
Between Regions