Raft consensus algorithm and leader election in MongoDB vs CoachroachDB

Geek Culture
Published in
12 min readAug 17, 2021

We live in a distributed world, whether we like it or not. Distributed systems are everywhere, from Kafka and Cassandra to your Mongo DB cluster. Because of this, as a developer, you don’t have to worry about your cluster failing due to replication issues or clock misalignment issues. Instead, you can consider all of the state machines (replicated state machines) as if they’re one entity.

Consensus Module architecture

For fault-tolerant, strongly consistent distributed systems, the distributed consensus is an essential primitive. A replicated log is often used to create replicating state machines. There is a log on each server, which contains a series of commands that are executed by the state machine in order. So, each state machine executes the same set of commands. The consensus algorithm’s job is to maintain consistency in the replicated log. A server’s consensus module takes commands from clients and logs them. To ensure that every log finally contains the same requests in the same sequence, even if some servers fail, it connects with consensus modules on other servers.

In the context of MongoDB, this may be seen as you providing command and the consensus algorithm handling replication throughout the replica set.

For those of us who have tried to comprehend Paxos, it has left us scratching our heads and frantically searching for a consensus algorithm that is both efficient and understandable. Or maybe trying to understand how Mongo DB works internally, you’ve come to the right place and are in for a ride. Let’s start.

Let’s build a RAFT

Before creating the algorithm, let’s set up a few limitations to simplify our design.
1. The systems are non-byzantine: AKA, we can trust all machines in our cluster.
2. broadcastTime << electionTimeout << MTBF: Before it can begin catering to requests, the algorithm is heavily reliant on an elected leader. Because of this, the time needed to communicate with the other computers should be shorter than the election timeout (will be discussed later). Because the average duration between failures ranges from a few hours to several months, we may not have to worry about it too much in real-world scenarios.


Raft, unlike Paxos, establishes consensus by first electing a distinguished leader and then delegating all responsibility for administering the replicated log to the leader. The leader collects log entries from clients, duplicates them on other servers, and notifies servers when log entries can be safely applied to their state machines. Any machine can be in any of the three states at any given time:
1. Leader: The leader is the bad boy that caters to all the requests.
2. Candidate: When a leader dies any follower can stand for an election and this is called the candidate.
3. Follower: A follower simply takes commands from the leaders and if anyone from the outside tries to communicate with it, the follow redirects the request to the last leader it knows of.

Leader Election

A leader is chosen through a procedure known as election (as if that wasn’t obvious enough). Every server has a random timeout, which resets whenever the follower receives a heartbeat from the leader. If the follower does not hear from the leader for an extended period of time, the follower becomes a candidate and begins soliciting votes to become a leader via the “RequestVote RPC”.

Raft keeps track of the passage of time using a counter called “term” When a new leader is elected, the term count is increased and the current term is sent with every pulse and Append Entry RPC. Any follower, candidate, or leader with terms less than those obtained in the heartbeat admits that it is running in an outdated state and immediately resolves to a follower state.

In the current term, all followers can vote for only one leader. Once a follower has voted for a specific term, they can only vote for candidates standing for the next term or above. Given below is the parameters that is sent when a candidate asks for a vote and under what condition the vote is granted to the candidate.

Let’s see an election in action.

Step 1: The cluster starts.
Because no leader has been nominated when the cluster begins, it will be unable to service any requests. All of the servers’ timers begin to tick. You can notice the initial term for all the servers has been initialized to 1. Hence term 1 has no leaders.

Step 2: The first election

When a follower reaches the end of its term, it promotes itself to a candidate for the next term’s election. Term 2 in this example. The candidate then begins soliciting votes and keeps track of the number of votes received. When the followers vote, they extend their term and recognise that there is a new leader.

Step 3: Heartbeat and retaining control

Once a leader is elected, the leader sends out heartbeats regularly to retain its influence. The heartbeat prevents the followers from timing out, and the clocks are reset whenever a follower receives a heartbeat.

Log Insertion

Now that we have a leader. The cluster can resume normal operations and accept commands from the client. When the leader receives a command from a client, it first adds the entry to its log and then sends a “AppendEntry RPC” to all the other servers. The AppendEntry RPC’s arguments are listed below. Let’s dig a little deeper into it.

Step 1: Receiving a request from the client

As we can see, S5 is the current term’s leader (2). When the leader receives a request, it first adds it to its log entry before sending the AppendEntry RPC to other servers with the above-mentioned arguments. The leader informs the followers of the term he is on, the index he wants the followers to add the entry to, and the format of his previous log entry. It’s worth noting that a heartbeat is an AppendEntry RPC with no data to append.

Step 2: The followers acknowledge the RPC

When the followers receive an AppendEntry RPC, they first check to see if the leader is legitimate. If the follower’s current term is greater than the leader’s term, the follower rejects the AppendEntry RPC, informing the leader that he is stale, and the leader instantly enters the follower state. However, if the words match the follower, verify its log entries; if the follower’s previous log entry does not match the leader’s, the follower rejects the RPC and awaits correction (more on this later). If all goes well, the follower adds the entry to its log and waits for the leader to issue a commit command. See the dotted lines? That means the entry is still not committed.

Step 3: Commit

When the majority of the followers acknowledge the leader, the leader knows it’s safe to commit the log. The leader then requests that everyone commit the log. For followers who respond slowly, or the leader keeps retrying until an acknowledgement is received. It is important to remember that a log is duplicated to the followers but can only be considered persistent once the leader asks everyone to commit it, which it will not do unless it hears from the majority of the followers. In other words, an uncommitted log is likely to be erased by other leaders. Raft, on the other hand, assures that any log that has been committed cannot be deleted under any circumstances.

Log Correction

Many times, a follower or leader may have been offline and was unable to keep up with the most recent update. Let’s examine how Raft deals with a failed leader and how the failed leader is brought up to speed with the rest of the servers.

Step 1: The falling of the leader

Assume S3 was the current term’s leader and it fails. The clocks on the other servers will continue to tick until an election timeout on any of the servers is reached. The server will position itself as a contender and begin soliciting votes to become the leader.

Step 2: Starting off a new term

Three additional terms have passed while server S3 was offline, and server S3 still believes he is the leader while he is down. The new leader, on the other hand, continues to work as usual while overseeing the cluster. It’s now time to bring up S3.

Step 3: Rise of S3

Server S3 wakes up and believes he is the leader. If he sends a command to any of the followers, the followers will alert S3 to the presence of a new leader, and it will instantly revert to the follower state and await commands from S4. This completes the leader correction. Let us now bring S3’s logs back up to date.

Step 4: A new request in the cluster

Assume a new request has been made to the leader. The leader broadcasts an AppendEntry RPC to all followers, together with the previous log-term and log entries. Except for S3, all of the followers agree with the RPC. Raft guarantees that the logs will be free of holes. Since S3’s logs have not been updated, it is unable to add anything new to its log. While S3 waits for log rectification, the other followers accept the logs and commit them to their logs.

Step 5: Begining log correction for S3

Inconsistencies are handled by the leader by pushing the followers’ logs to duplicate their own. Conflicting entries in follower logs will be overridden by entries from the leader’s log. To reconcile a follower’s log with the leader’s, the leader must locate the most recent log entry where the two logs agree, delete any entries in the follower’s log after that point, and transmit the follower all of the leader’s entries after that point. All of these activities occur in response to the AppendEntries RPCs’ consistency check. The leader keeps a nextIndex for each follower, which is the index of the next log entry sent to that follower by the leader. When a leader takes power for the first time, it sets all nextIndex values to the index directly after the last one in its log. If a follower’s log differs from the leader’s, the AppendEntries consistency check will fail in the subsequent AppendEntries RPC. The leader decrements nextIndex and retries the AppendEntries RPC after a rejection. NextIndex will eventually reach a point when the leader and follower logs match. When this occurs, AppendEntries succeeds, removing any conflicting entries from the follower’s log and appending entries from the leader’s log (if any). When AppendEntries succeeds, the follower’s log matches the leader’s, and it will remain that way for the duration of the term.

The leader now finds the point where the leader’s log and S3’s logs match. Now the leader will start populating S3’s log with the latest logs.

The replication happens one by one until S3 is up to pace with S4.

Raft in Coachroach DB

If you’ve heard of cockroach DB. You may have heard that data is stored in what is known as keyspaces. Keyspaces are a linearly ordered set of key-value pairs that include where it lives as well as the row’s primary key. The cluster then separates the keyspaces into what are known as ranges. Ranges are 64MB data blocks that are duplicated across at least three nodes in the cluster to ensure consistency.

You may want to replicate these ranges across geolocations to encourage faster reads across zones. Cockroach DB divides the ranges into their own consensus group. Because a node can store several ranges. Because of this architecture, a node can be a member of numerous consensus groups. As more nodes are added to the cluster, the number of heartbeats necessary to keep the leader grows. As a result, just from the heartbeat, there is a massive spike in traffic.

The above graphic clearly shows that there is a lot of traffic in the system even when the system is idle. Furthermore, the number of ranges is far greater than the number of nodes in the cluster, so traffic only multiplies when new nodes are added. To address this issue, Cockroach DB created a new feature in Raft called a multi raft.

The problem of increased traffic is solved by utilising MultiRaft (and just one Raft instance per Node — Store in Cockroach source code language — rather than per Range). CockroachDB coalesces/uncoalesces heartbeats in a single response/request.

Coachrach DB caches rpc connections between nodes and uses the periodic heartbeat to compute clock skew and link latency. Individual raft consensus groups communicate over these established links as needed, but they do not generate the heartbeats. To assess whether to become candidates, followers in consensus groups examine the health of connections (i.e. whether or not a heartbeat was received inside the last heartbeat interval) after their timeouts expire.

Here’s how the transport looks like

Raft in Mongo DB

Then there’s everyone’s favourite Mongo DB. Except for a few variations, Mongo DB and raft have a lot in common. Let’s go through the differences 1 by 1.

  1. Apply first then replicate: When the commander receives a new command in the raft. Before committing to the changes, the leader first communicates them to the followers. Mongo DB takes a different approach, with the primary returning an acknowledgement that the data has been written to one of the nodes right away. Changes are then propagated between nodes, giving rise to the concept of read and write concerns in Mongo DB.
  2. Pull vs push based algorithm: In contrast to raft, where the leader is responsible for propagating changes among nodes. Chained replication is a feature of Mongo DB that allows a node to bring in updates from the nearest follower, allowing for faster replication and reduced strain on the primary aka leader.
  3. Configurable Election Timeouts: Raft sends election timeouts to machines at random, however, if the primary network is slow, this can result in a lot of elections in the network. Mongo DB allows users to change the election timeout (rule of thumb: average time of the network going down).
  4. Priorities: Mongo DB also allows you to assign server priorities. Whereas priority 0 indicates that the server will never run for election, if you have a data centre that you know is exceptionally strong, you might assign it the highest priority so that it can frequently become a leader.



Geek Culture

Sometimes it is the people no one can imagine anything of, do the things no one can imagine.