Raft — Understandable Consensus

Balraja Subbiah
5 min readJan 3, 2023

--

In my previous post i talked about building distributed consensus using paxos. In continuation of that lets talk about Raft protocol in this post.

This is “the Raft paper”, which describes Raft in detail: In Search of an Understandable Consensus Algorithm (Extended Version) by Diego Ongaro and John Ousterhout. A slightly shorter version of this paper received a Best Paper Award at the 2014 USENIX Annual Technical Conference.

Raft as name implies is one of the easier protocols that could be understood easily in comparision with paxos. Lets get into details

Any data processing system is modelled as a state machine. A state machine computes the latest state from the set of transactions applied to it. These applied transactions are stored as a log. In a distributed system context, all participating servers will maintain a state machine that computes identical copy of the state. These servers are collectively termed as replicated state machine.

Consensus is the fundamental problem in a replicated state machine. Consensus involves multiple servers agreeing on the order of transactions processed by the state machine. Consensus algorithms allows these machines to make progress as long as majority of servers are live with some failed ones.

Raft implements consensus by first electing a leader, then giving the leader complete control of the log.

The leader,

a. accepts log entries from clients
b. replicates them on other servers and
c. notifies them when it is safe to apply log entries to their state machines.

At any time the machines participating in raft can be in any one of these states namely

1. Leaderthe machine is elected as leader and now accepts transactions from clients
2. Followerthe machine is a replica and accepts transactions from the leader
3. Candiate the machine is a candidate in leader election

Raft divides the time into terms. At the start of each term there is a leader election followed by normal transaction processing. Each term will have only one leader. if there are split votes during leader election the term will end without selecting a leader. The terms act as a monotonously increasing logical clock to help detect stale leaders.

Leader Election

Leader election happens when follower nodes fails to receive heartbeats from their leader. At this point the follower changes itself to candidate, increases the term number and seeks votes from other nodes.

a. If it receives majority it gets elected as a new leader for this term
b. If it receives an ask to vote for a higher term than its current term then it will turn itself to follower.
c. If it does not receive majority (another server also became candidate at the same time) and there is a split vote, the candidate server ends the current term after a random timeout and starts a new term. In order to prevent log jam of split votes, the raft proposes to use random timouts by each candidate server when ending a term.

Append to log and merge

The clients send operations to be applied on a state machine to the leader. The leader on receiving the client request, appends the operation to its log. Each entry in the log is identified by montonically increasing index, and for each entry the current leader’s term number is also stored.

After appending to the log, the leader forwards the operation to all followers. Once majority of them append the operation to their log, the leader applies the operation to state machine and sends response back to the client. Once leader applies operation to state machine then the log is marked as committed.

Raft gaurantees the following,

If two entries in different logs have the same index and term, then they store the same command.””

The leader creates one log entry at a time for the given index at a term.

If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.””

The leader will send the index and term of the log entry preceeding the appended operation to followers when requesting them to append an operation to their log. Only if the previous term and index match, the followers will append the current operation to their log, else they will fail.

Based on these rules, in all good times, the logs of the leader and followers stay consistent. However, afer a series of crashes of leaders and followers might leave the system inconsistent as shown below.

The inconsistency between leader and follower logs could be because of

1. Missing entries in the follower’s log
2. Uncomitted entries in the follower’s log when it was a leader.

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log

When a leader is elected for each follower it maintains an nextIndex. Initially it will be set to the next free index as per leader’s log. If request to append fails for a follower, the leader will decrement the follower’s nextIndex and sends all entries from decremented nextIndex to end of its log index as part of retried append.

An example depicting inconsistencies between leader and follower’s logs.

Missing entries

In case of follower A, the leader sends following append log requests till that point when both the logs turns in sync.

t1 — New leader comes to power and sets the nextIndex for all followers as 11

t2 — appendLog(prevIndex = 10, appendedOperations=[<11-op>])

Since prevIndex does not match between leader and follower, leader decrements nextIndex[B] to 10

t2 — appendLog(prevIndex = 9, appendedOperations=[<10-op>, <11-op>])
t3 — appendLog(prevIndex = 8, appendedOperations=[<9-op>, <10-op>, <11-op>])
t3 — appendLog(prevIndex = 7, appendedOperations=[<8-op>, <9-op>, <10-op>, <11-op>])

After this step both the server logs are in sync.

Uncomitted entries

In case of follower B, the leader sends following append log request with the latest entry and follower overwrites its log with the entry from leader while dropping the uncomitted entries.

t1 — New leader comes to power and sets the nextIndex for all followers as 11
t2 — appendLog(prevIndex = 10, appendedOperations=[<11-op>])

the follower drops entries 11 and 12 from its logs and adds the latest operation from leader at index 11.

Paxos also supports seamless configuration(set of servers participating in consensus) changes as well supports log compaction via snapshotting. But that’s for another day.

--

--