Introduction | Andrew Odendaal

Understanding Distributed Systems Failures

Failure Modes and Models

Recognizing what can go wrong:

Common Failure Modes:

Hardware failures
Network partitions
Service dependencies failures
Resource exhaustion
Data corruption
Clock skew
Configuration errors
Deployment failures
Cascading failures
Thundering herd problems

The Fallacies of Distributed Computing:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn’t change
There is one administrator
Transport cost is zero
The network is homogeneous

Failure Models:

Fail-stop: Components fail by halting
Crash-recovery: Components fail and may restart
Omission: Components fail to respond
Byzantine: Components behave arbitrarily or maliciously
Timing: Components respond too early or too late

Example Network Partition Scenario:

Before Partition:
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│               │     │               │     │               │
│  Service A    │────▶│  Service B    │────▶│  Service C    │
│  (Region 1)   │     │  (Region 2)   │     │  (Region 1)   │
│               │     │               │     │               │
└───────────────┘     └───────────────┘     └───────────────┘

After Partition:
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│               │     │               │     │               │
│  Service A    │  ╳  │  Service B    │  ╳  │  Service C    │
│  (Region 1)   │     │  (Region 2)   │     │  (Region 1)   │
│               │     │               │     │               │
└───────────────┘     └───────────────┘     └───────────────┘
                           ↑
                           │
                      Network Partition
                      Between Regions

Continue Your Learning

This is part 1 of 5 in the comprehensive guide.

Guide Overview See all 5 parts Next → Fundamentals and Core Concepts