Deep Dive into Raft: Consensus Algorithms in Distributed Systems

9 min readJan 21, 2024

In this post, we take a deep dive into the Raft consensus algorithm, essential for distributed systems. We explore key mechanisms like leader election, log replication, and safety, plus aspects like cluster membership changes, log compaction, and client interactions. Discover how Raft achieves reliable, consistent operations in distributed networks.

Credit

What is a Consensus Algorithm?

A consensus algorithm allows a group of machines to operate cohesively and agree on the system state even in the presence of failures. It typically comes up in the context of replicated state machines. In such systems, a collection of servers compute identical copies of the same state and can continue to operate even if some servers are down.

Replicated State Machines

**Figure 1:** Replicated State Machine Architecture [1]

Replicated state machines are an approach for building fault-tolerant services in distributed systems, which are typically implemented using a replicated log, as shown in Figure 1. In this setup, each server stores a log containing a series of commands, which its state machine executes in order. Since every log holds the same commands in the same order, each state machine processes the same command sequence.

The role of a consensus algorithm is to keep the replicated log consistent across all servers so that the servers appear to form a single, highly reliable state machine.

What is Raft?

Raft is a consensus algorithm for keeping the replicated log consistent. The algorithm starts by first electing a leader and then gives the leader complete responsibility for managing the replicated log. This leader accepts log entries from clients, replicates them across other servers, and tells the servers when it is safe to apply these entries to their state machines.

The Raft consensus algorithm is composed of three relatively independent subproblems:

Leader Election: One server is elected as leader for a given term. This leader manages the log and replicates it across other servers. A new leader is elected when the existing leader fails.
Log Replication: The leader receives client requests, appends these commands to its log, and then replicates them across other servers.
Safety: This principle ensures that if any server has applied a particular log entry to its state machine, no other server can apply a different command for the same log index. We will delve into how this property is achieved.

We will start by discussing an overview of Raft and its basic components, then explore these three subproblems in detail.

Raft Basics

Server States

In Raft, each server operates in one of three states at any given time:

Leader: Handles all client requests, replicates log entries on other servers, and tells servers when it is safe to apply these entries to their state machines.
Follower: Passively responds to requests from leaders and candidates without initiating requests.
Candidate: Engages in the election process to become a new leader.

State Transitions

As depicted in Figure 2, servers transition between states under specific conditions. Followers, who respond only to requests from others, will become candidates if they receive no communication within a certain timeframe. A candidate who secures a majority vote from the full cluster becomes the new leader. Leaders typically operate until they fail.

Term

Raft divided time into terms, and each term begins with an election where one or more candidates try to become the leader. Each server stores the current term number, which is exchanged whenever servers communicate and allows servers to detect obsolete information such as stale leaders.

Communications

Raft servers communicate using remote procedure calls (RPCs) and the basic consensus algorithm requires two types of RPCs:

RequestVote RPC: Initiated by candidates to gather votes during the election process.
AppendEntries RPC: Issued by the leader to replicate log entries across followers; it also serves as a form of heartbeat.

HeartBeat

The leader regularly sends heartbeat messages to all followers to maintain authority and prevent new elections.

Leader Election

Initiating as Followers

When the servers start up, they begin as followers. A server remains a follower as long as it continues to receive valid RPCs from a leader or a candidate. However, if a follower receives no valid RPCs within a predefined election timeout period, it assumes the absence of an active leader and starts a leader election process.

Transition to Candidate and Start Election

To start an election, the follower first increments its current term number, reflecting the start of a new election term, and transition to a candidate. It then votes for itself and sends out RequestVote RPCs to all other servers in the cluster. Importantly, each server can vote for at most one candidate per term, typically on a first-come-first-served basis.

Possible Election Outcomes

During the election, there are three potential outcomes for a candidate:

Winning the Election: The candidate may win the election and become the leader by securing a majority of the votes.
Recognizing a New Leader: If the candidate, while waiting for votes, receives information about another server claiming to be the leader, it evaluates this claim. If this new leader’s term is equal to or greater than the candidate’s current term, the candidate accepts the leader’s legitimacy and reverts to being a follower. If not, the election process continues.
Election Deadlock: In cases where multiple candidates emerge simultaneously, votes can be split, preventing any single candidate from achieving a majority. When this happens, the candidates time out, increment their term, and start a new election round. Raft uses randomized election timeout to minimize the chances of split votes caused by multiple candidates.

Understanding the leader election process provides a foundation for exploring how the leader manages log replication, which we will discuss next.

Log Replication

Initiating Log Replication

After a leader is elected, it begins to process client requests. These requests typically contain commands that need to be executed by the replicated state machines.

Log Entry Creation and Replication

Upon receiving a client request, the leader first appends the command to its log as a new entry. It then replicates this entry to its followers through the AppendEntries RPCs. Once the entry has been safely replicated to a majority of the followers, the leader applies the command to its state machine and returns the execution result to the client.

Log Structure and Commitment Process

The logs are composed of sequentially indexed entries. Each entry contains the command, its term, and its log index. An entry is considered committed when it is replicated on a majority of the servers. The leader tracks the highest committed log index and includes this information in future AppendEntries RPCs, which allows the followers to eventually learn and apply the committed entries to their local state machine.

Log Inconsistencies and Resolutions

Normally, the logs of the leader and followers stay consistent. However, leader crashes can leave the log inconsistent. In such cases, the followers may have missing entries or extra uncommitted entries not present in the leader’s log, as shown in Figure 4.

The leader identifies inconsistencies during a consistency check performed by AppendEntries RPC. This RPC includes the index and term of the log entry that precedes the new entries. If a follower’s log does not match these details, it rejects the new entry, signaling inconsistency. The leader resolves these inconsistencies by overwriting the conflicting entries in the followers’ logs with its own entries. The safety and correctness of this overwriting process will be explained in the “Safety” section.

Synchronizing Logs

The leader keeps a nextIndex for each follower, which indicates the next log entry to be sent. Initially, this is set to one more than the leader’s last log index, assuming followers are up-to-date. If inconsistencies arise, the leader decrements nextIndex until the logs between the leader and the follower match. Once a match is achieved, the AppendEntries RPC will succeed, which removes conflicting entries in the follower’s log and append any new entries from the leader’s log, ensuring consistency.

In summary, log replication in Raft involves more than just replicating log entries; it also involves maintaining log consistency across the cluster, particularly in cases of leader failures.

Next, we’ll explore a crucial Safety restriction in Raft. This restriction ensures the safety of the leader overwriting followers’ logs. Without this, as we have seen so far, there would be a risk of the leader inadvertently overwriting committed entries in the followers’ logs.

Safety

Ensuring Consistent Command Execution

The primary objective of safety in Raft is to guarantee that each state machine executes the same commands in the same order. No two servers may apply different commands for the same log index.

Challenges in Maintaining Consistency

The mechanisms described so far are not quite sufficient to ensure that each state machine executes the same commands in the same order. For example, a follower might be unavailable and miss several log entries while they are being committed by the leader. If this follower then becomes a new leader, it could potentially overwrite the previously committed entries with new ones, leading to different state machines executing different command sequences.

Election Restriction

To address this, Raft imposes an election restriction. Under this restriction, a candidate can only be elected as a leader for a new term if its log contains all of the entries committed in the previous terms. This ensures that any elected leader has the most up-to-date and complete log.

How Election Restriction Works?

During the leader election, candidates request votes from other servers through the RequestVote RPC. A server will refuse to vote for a candidate if its log is more up-to-date than the candidate’s. This is determined by comparing the index and term of the last log entry. A log is considered more up-to-date if its last log entry has a greater term, or if the terms are the same but the log is longer.

This newly added election restriction ensures that any elected leader has all the committed entries from the previous term. This prevents the scenario where a new leader might overwrite previously committed entries. As mentioned earlier, leaders handle inconsistencies by overwriting conflicting entries in follower logs. This election restriction makes such overwriting safe, as it guarantees that any leader’s log holds all committed entries.

With an understanding of how Raft elects its leader, replicates logs, ensures log consistency, and applies the same sequence of commands across state machines, we have covered the foundational aspects of the Raft consensus algorithm.

Next, we will delve into some additional aspects of Raft.

Cluster Membership Change

Raft handles configuration changes, such as adding or removing servers, through a two-phase approach. Initially, the cluster enters a transitional configuration called joint consensus. Once the joint consensus has been committed, the system transitions to the new configuration. This approach maintains continuous availability and prevents split-brain scenarios during the changeover.

Log Compaction

In practical systems, logs cannot grow indefinitely. To address this, each server independently takes snapshots of its log. These snapshots cover all the committed entries and are then stored in stable storage. Raft also includes a small amount of metadata in the log that the snapshot replaces to maintain continuity and consistency within the cluster, especially during server restarts or when followers need to synchronize their logs with the leader.

Client Interaction

Locating the Cluster Leader

In Raft, clients interact with the cluster leader for all requests. When a client first starts up, it connects to a randomly chosen server. If this server is not the leader, it rejects the client’s request but provides information about the current leader.

Linearizable Semantics

Raft ensures that both writes and reads are linearizable. This means every operation appears to execute instantaneously and exactly once within the timeframe of its execution and response. Any read operation following a write will always reflect the latest write, providing clients with the most up-to-date and accurate view of the system’s state. To prevent stale reads, the leader verifies its information with the cluster before responding to read requests, since the data may be stale if the leader responding to the request has been superseded by a newer leader of which it is unaware.

Conclusion

In this exploration of the Raft consensus algorithm, we have covered its essential components including leader election, log replication, safety measures, log compaction, and cluster membership changes. Additionally, we examined Raft’s approach to client interactions and its consistency guarantees.

Additional Resources

References

[1] In Search of An Understandable Consensus Algorithm by Diego Ongaro and John Ousterhout