Sitemap
Tech Wrench

Tech tales, anecdotes and musings…

Consistency & Consensus for System Design Interview (8): leader election

--

PREV | HOME | NEXT | System Design Resource List

Don’t forget to get your copy of Designing Data Intensive Applications, the single most important book to read for system design interview prep!

Check out ByteByteGo’s popular System Design Interview Course

Introduction

We have previously discussed single leader replication, where all writes are directed to a single leader and then applied in the same order to all the followers. Note that the writes applied to the leader must be delivered exactly once in exactly the same order to all the followers, much like a total order broadcast! So we can say that single leader replication is equivalent to a total order broadcast algorithm. There are two ways to elect a leader in case of single leader replication:

Grokking Modern System Design for Software Engineers and Managers
  1. Human Intervention: A node within a cluster can be configured as the leader. If the leader fails, a human operator is required to elect a new one. This form of single leader replication satisfies all the properties of a consensus algorithm except termination as a failure requires a human to come and designate a new leader.
  2. Auto Leader Election: Some systems allow for a new leader from among the followers to be elected in case the current one fails. This form of single leader replication can be thought of as fault-tolerant total order broadcast. The system has to be cautious about ending up in a split-brain situation, where two or more nodes consider themselves to be the rightful leader causing permanent data loss. This implies that for auto-election of the leader, a consensus must exist among the nodes as to who is the next leader after a failover. This may sound circular now, since we are saying we need consensus to elect a leader, the consensus algorithms (Paxos, Raft, ZAB, etc) are likened to total order broadcast, while total order broadcast algorithms are likened to single leader replication and single leader replication requires a leader to be elected which in turn requires consensus. Next we’ll see how a leader is elected by consensus algorithms.

Land a higher salary with Grokking Comp Negotiation in Tech.

Electing leaders

All consensus algorithms elect a leader in one form or another. Generally, a new election is held when the current leader is thought to be dead. The new election is assigned a monotonically increasing number. This number is referred to as the epoch number or has been given different names by different algorithms, e.g. Raft calls it term number, Paxos calls it ballot number and Viewstamped Replication calls it view number. The epoch numbers across elections are totally ordered. If two nodes both consider themselves to be the leaders then the tie is broken in favor of the node with the higher epoch number.

Ace the machine learning engineer interview with Grokking the Machine Learning Interview.

A node can’t declare itself as the leader but has to receive votes from a majority or quorum of other nodes in the system. Next if the leader wants to propose a change it typically sends the proposal to a majority/quorum of nodes for approval. A node receiving a proposal from the purported leader gives an approval if the node isn’t aware of any other leader with a higher epoch number. Thus there are two voting rounds, first to elect the leader and the second to approve the leader’s proposed change. The quorums voting in the two rounds must overlap, i.e. there should be at least one node that participates in both the voting rounds. This guarantees that when approving a proposed change, at least one node that also participated in the latest leader election round was also present. The leader on receiving the approval for the proposed change is thus assured that no leader election with a higher epoch number has taken place and it is still the rightful leader.

Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.

Two-phase commit may sound similar to the above described algorithm but there are some important differences. For example, in two-phase commit the coordinator must receive a yes from all the participants whereas in consensus algorithms approval is required from only a majority. Furthermore, the coordinator/leader in two-phase commit is selected and never elected, while the leader is always elected in consensus algorithms. Finally, consensus algorithms bake in a recovery process that allows for participating nodes to get into a consistent state after a new leader is elected.

Limitations

Consensus algorithms also come with their limitations:

Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job!

  1. For one, a consensus algorithm can operate only with a strict majority of nodes being functional. For instance, to be able to tolerate one failure, the system must have three nodes so that if one fails, the other two remain operational and form a majority. Similarly, to tolerate two failures, there should be at least five nodes. Moreover, in case of a network partition, the majority is able to make progress, while the unreachable nodes don’t.
  2. Another limitation of consensus algorithms is that the number of nodes in the cluster is assumed to be fixed by most consensus algorithms and can’t be changed dynamically.
  3. Timeouts are used to detect failed nodes in consensus algorithms but that may pose a problem for environments with high variability in network delays, e.g. cross-continental distributed systems. In such environments it is possible that transient network issues cause participant nodes to believe the leader to be dead when in fact the leader is just fine and unavailable only for the time being due to network failures. Such situations can cause frequent leader election, whereby the system spends more time electing the leader than doing useful work. Research has also shown that under specific network failure conditions, consensus algorithms can experience edge cases and make no progress. For instance, in case of Raft, if one of the network links is assumed to be consistently faulty while the rest of the network operates correctly, Raft can see an edge case where the leadership bounces back and forth between two nodes and no progress is made by the algorithm.

If you are interviewing, consider buying our number#1 course for Java Multithreading Interviews.

Grokking the Coding Interview: Patterns for Coding Questions

Your Comprehensive Interview Kit for Big Tech Jobs

0. Grokking the Machine Learning Interview
This course helps you build that skill, and goes over some of the most popularly asked interview problems at big tech companies.

1. Grokking the System Design Interview
Learn how to prepare for system design interviews and practice common system design interview questions.

2. Grokking Dynamic Programming Patterns for Coding Interviews
Faster preparation for coding interviews.

3. Grokking the Advanced System Design Interview
Learn system design through architectural review of real systems.

4. Grokking the Coding Interview: Patterns for Coding Questions
Faster preparation for coding interviews.

5. Grokking the Object Oriented Design Interview
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions

6. Machine Learning System Design

7. System Design Course Bundle

8. Coding Interviews Bundle

9. Tech Design Bundle

10. All Courses Bundle

--

--

SystemDesign
SystemDesign

Written by SystemDesign

The ultimate Poor man’s system design interview prep guide !

No responses yet