2. Discovery
- The new leader synchronizes its state with followers
- Followers acknowledge the leader and confirm synchronization
3. Broadcast
- The leader receives client requests and creates proposals
- Each proposal is assigned a monotonically increasing identifier (zxid)
- Proposals are broadcast to followers
- When a majority of followers acknowledge a proposal, the leader commits it
- The commit decision is broadcast to followers
ZAB Implementation Example
Here’s a simplified implementation of ZAB’s core components:
public class ZabServer {
enum ServerState {
LOOKING, FOLLOWING, LEADING
}
private ServerState state;
private long currentEpoch;
private long lastZxid;
private List<Transaction> history;
private Map<Long, Transaction> pendingTransactions;
private Set<ZabServer> quorum;
// Leader election
public void startLeaderElection() {
state = ServerState.LOOKING;
// Send out vote with last zxid
Vote myVote = new Vote(getServerId(), currentEpoch, lastZxid);
broadcastVote(myVote);
// Collect votes and determine winner
// (Simplified - actual implementation uses multiple rounds)
Map<Integer, Vote> receivedVotes = collectVotes();
Vote electedLeader = selectLeader(receivedVotes);
if (electedLeader.getServerId() == getServerId()) {
becomeLeader();
} else {
becomeFollower(electedLeader.getServerId());
}
}
// Leader functionality
private void becomeLeader() {
state = ServerState.LEADING;
currentEpoch++;
// Discovery phase
List<FollowerInfo> followers = collectFollowerInfo();
long newEpochZxid = createNewEpochZxid();
// Find highest zxid among followers
long highestZxid = lastZxid;
for (FollowerInfo info : followers) {
if (info.getLastZxid() > highestZxid) {
highestZxid = info.getLastZxid();
}
}
// Synchronize followers
for (FollowerInfo follower : followers) {
synchronizeFollower(follower, highestZxid);
}
// Start processing client requests
startProcessingRequests();
}
// Process a client request as leader
public void processRequest(Request request) {
// Create transaction
Transaction txn = new Transaction(
createZxid(),
request.getOperation(),
request.getData()
);
// Store in pending transactions
pendingTransactions.put(txn.getZxid(), txn);
// Broadcast to followers
List<ZabServer> acks = broadcastTransaction(txn);
// If majority acknowledge, commit
if (acks.size() > quorum.size() / 2) {
commit(txn);
broadcastCommit(txn);
}
}
// Follower functionality
private void becomeFollower(int leaderId) {
state = ServerState.FOLLOWING;
// Connect to leader
connectToLeader(leaderId);
// Send follower info
sendFollowerInfo(lastZxid);
// Process messages from leader
while (state == ServerState.FOLLOWING) {
Message message = receiveFromLeader();
if (message instanceof Transaction) {
Transaction txn = (Transaction) message;
// Validate transaction
if (isValidTransaction(txn)) {
// Log transaction
logTransaction(txn);
// Acknowledge
sendAck(txn.getZxid());
}
} else if (message instanceof Commit) {
Commit commit = (Commit) message;
// Apply committed transaction
applyTransaction(commit.getZxid());
}
}
}
// Create a new zxid
private long createZxid() {
// zxid is a 64-bit number:
// - high 32 bits: epoch
// - low 32 bits: counter
return (currentEpoch << 32) | (++lastZxid & 0xFFFFFFFFL);
}
// Apply a transaction to the state machine
private void applyTransaction(long zxid) {
Transaction txn = pendingTransactions.get(zxid);
if (txn != null) {
// Apply to state machine
applyToStateMachine(txn);
// Add to history
history.add(txn);
// Remove from pending
pendingTransactions.remove(zxid);
}
}
}
When to Use ZAB
ZAB is appropriate when:
- You’re using ZooKeeper as your coordination service
- You need a protocol optimized for primary-backup replication
- You require strong ordering guarantees for operations
Challenges with ZAB
- Specifically designed for ZooKeeper, less general-purpose
- Less documentation and fewer implementations compared to Paxos and Raft
- Leader-based design can be a bottleneck
Byzantine Fault Tolerance (BFT) Algorithms
The algorithms discussed so far assume crash-failure: nodes either work correctly or stop working entirely. Byzantine fault tolerance addresses a more challenging scenario where nodes can behave arbitrarily, including sending conflicting information to different parts of the system.
Practical Byzantine Fault Tolerance (PBFT)
PBFT, introduced by Castro and Liskov in 1999, was the first practical Byzantine consensus algorithm. It can tolerate up to f Byzantine failures with 3f+1 total nodes.
How PBFT Works
PBFT operates in three phases:
- Pre-prepare: The leader assigns a sequence number to a request and multicasts it to all replicas
- Prepare: Replicas verify the request and broadcast prepare messages
- Commit: Once a replica receives 2f prepare messages, it broadcasts a commit message
A request is executed once a replica receives 2f+1 commit messages.
When to Use BFT Algorithms
BFT algorithms are necessary when:
- You cannot trust all nodes in your system
- Nodes might be compromised or behave maliciously
- You’re building systems like blockchains or critical infrastructure
Challenges with BFT
- Higher message complexity (O(n²) where n is the number of nodes)
- Requires more nodes to tolerate the same number of failures
- Significantly more complex to implement correctly