2. Discovery

  • The new leader synchronizes its state with followers
  • Followers acknowledge the leader and confirm synchronization

3. Broadcast

  • The leader receives client requests and creates proposals
  • Each proposal is assigned a monotonically increasing identifier (zxid)
  • Proposals are broadcast to followers
  • When a majority of followers acknowledge a proposal, the leader commits it
  • The commit decision is broadcast to followers

ZAB Implementation Example

Here’s a simplified implementation of ZAB’s core components:

public class ZabServer {
    enum ServerState {
        LOOKING, FOLLOWING, LEADING
    }
    
    private ServerState state;
    private long currentEpoch;
    private long lastZxid;
    private List<Transaction> history;
    private Map<Long, Transaction> pendingTransactions;
    private Set<ZabServer> quorum;
    
    // Leader election
    public void startLeaderElection() {
        state = ServerState.LOOKING;
        
        // Send out vote with last zxid
        Vote myVote = new Vote(getServerId(), currentEpoch, lastZxid);
        broadcastVote(myVote);
        
        // Collect votes and determine winner
        // (Simplified - actual implementation uses multiple rounds)
        Map<Integer, Vote> receivedVotes = collectVotes();
        Vote electedLeader = selectLeader(receivedVotes);
        
        if (electedLeader.getServerId() == getServerId()) {
            becomeLeader();
        } else {
            becomeFollower(electedLeader.getServerId());
        }
    }
    
    // Leader functionality
    private void becomeLeader() {
        state = ServerState.LEADING;
        currentEpoch++;
        
        // Discovery phase
        List<FollowerInfo> followers = collectFollowerInfo();
        long newEpochZxid = createNewEpochZxid();
        
        // Find highest zxid among followers
        long highestZxid = lastZxid;
        for (FollowerInfo info : followers) {
            if (info.getLastZxid() > highestZxid) {
                highestZxid = info.getLastZxid();
            }
        }
        
        // Synchronize followers
        for (FollowerInfo follower : followers) {
            synchronizeFollower(follower, highestZxid);
        }
        
        // Start processing client requests
        startProcessingRequests();
    }
    
    // Process a client request as leader
    public void processRequest(Request request) {
        // Create transaction
        Transaction txn = new Transaction(
            createZxid(),
            request.getOperation(),
            request.getData()
        );
        
        // Store in pending transactions
        pendingTransactions.put(txn.getZxid(), txn);
        
        // Broadcast to followers
        List<ZabServer> acks = broadcastTransaction(txn);
        
        // If majority acknowledge, commit
        if (acks.size() > quorum.size() / 2) {
            commit(txn);
            broadcastCommit(txn);
        }
    }
    
    // Follower functionality
    private void becomeFollower(int leaderId) {
        state = ServerState.FOLLOWING;
        
        // Connect to leader
        connectToLeader(leaderId);
        
        // Send follower info
        sendFollowerInfo(lastZxid);
        
        // Process messages from leader
        while (state == ServerState.FOLLOWING) {
            Message message = receiveFromLeader();
            
            if (message instanceof Transaction) {
                Transaction txn = (Transaction) message;
                
                // Validate transaction
                if (isValidTransaction(txn)) {
                    // Log transaction
                    logTransaction(txn);
                    
                    // Acknowledge
                    sendAck(txn.getZxid());
                }
            } else if (message instanceof Commit) {
                Commit commit = (Commit) message;
                
                // Apply committed transaction
                applyTransaction(commit.getZxid());
            }
        }
    }
    
    // Create a new zxid
    private long createZxid() {
        // zxid is a 64-bit number:
        // - high 32 bits: epoch
        // - low 32 bits: counter
        return (currentEpoch << 32) | (++lastZxid & 0xFFFFFFFFL);
    }
    
    // Apply a transaction to the state machine
    private void applyTransaction(long zxid) {
        Transaction txn = pendingTransactions.get(zxid);
        if (txn != null) {
            // Apply to state machine
            applyToStateMachine(txn);
            
            // Add to history
            history.add(txn);
            
            // Remove from pending
            pendingTransactions.remove(zxid);
        }
    }
}

When to Use ZAB

ZAB is appropriate when:

  • You’re using ZooKeeper as your coordination service
  • You need a protocol optimized for primary-backup replication
  • You require strong ordering guarantees for operations

Challenges with ZAB

  • Specifically designed for ZooKeeper, less general-purpose
  • Less documentation and fewer implementations compared to Paxos and Raft
  • Leader-based design can be a bottleneck

Byzantine Fault Tolerance (BFT) Algorithms

The algorithms discussed so far assume crash-failure: nodes either work correctly or stop working entirely. Byzantine fault tolerance addresses a more challenging scenario where nodes can behave arbitrarily, including sending conflicting information to different parts of the system.

Practical Byzantine Fault Tolerance (PBFT)

PBFT, introduced by Castro and Liskov in 1999, was the first practical Byzantine consensus algorithm. It can tolerate up to f Byzantine failures with 3f+1 total nodes.

How PBFT Works

PBFT operates in three phases:

  1. Pre-prepare: The leader assigns a sequence number to a request and multicasts it to all replicas
  2. Prepare: Replicas verify the request and broadcast prepare messages
  3. Commit: Once a replica receives 2f prepare messages, it broadcasts a commit message

A request is executed once a replica receives 2f+1 commit messages.

When to Use BFT Algorithms

BFT algorithms are necessary when:

  • You cannot trust all nodes in your system
  • Nodes might be compromised or behave maliciously
  • You’re building systems like blockchains or critical infrastructure

Challenges with BFT

  • Higher message complexity (O(n²) where n is the number of nodes)
  • Requires more nodes to tolerate the same number of failures
  • Significantly more complex to implement correctly