RAFT: Re-implementation and Stress Test

Published in

Princeton Systems Course

9 min readMay 19, 2024

Siyang Wu, Zhongtian He

Introduction

The consensus problem is to make all processes in a distributed system agree on a single value after one or more values are proposed. Consensus algorithms increase fault tolerance given a collection of machines working together as a group. Leslie Lamport’s Paxos has become synonymous with consensus. However, it is difficult to understand as the paper presents a theoretical view based on a set of axioms that the consensus protocol should satisfy.

RAFT is a consensus algorithm for managing a replicated log. It is more understandable as it provides a modular approach, separating leader election with propose/accept phases. In RAFT, the strong leader feature ensures that only the leader can propose. When the leader receives a command from a client, it appends the entry to the local log. If the last log entry is not applied for a follower, the leader sends an append entry RPC call to the follower. Upon mismatching failure which could happen after a leader change, the index is decreased and the append entry RPC is retried, in order to make the followers log completely following the current leader. If a majority has matched an entry, the leader will commit the entry. The safety of RAFT is guaranteed by the strong leader property.

The leader periodically sends heartbeat messages to every follower. If some replica is no longer receiving messages from the leader, it starts the leader election phase and becomes a candidate. A candidate wins an election if it receives votes from the majority. To avoid multiple leaders being selected while guaranteeing liveness, each message is assigned a term. Each replica maintains a variable called current_term which corresponds to the latest term of any message received by the replica. The replica ignores messages from the previous term. During a leader election phase, each replica has one vote in each term, and a new term starts after a time-out if there are no majority votes. Randomized timers are used for optimization to make the majority votes happen with a higher probability, by making a random time gap for the time that one replica becomes a candidate with the others so that there are less overlap and collisions between multiple leader election terms (corresponding to the election timeout parameter, see following sections in detail).

Similar to Paxos, RAFT guarantees perfect safety in the sacrifice of part of the liveliness. However, only from the description of the RAFT algorithm, it is not obvious how available an RAFT system is, especially during failure and runs on a large scale, and how parameters included in the RAFT algorithm affect its performance. Thus, in this project, we study RAFT’s performance, more specifically, latency, under different failure patterns and scales, when running with different parameters.

Implementation

Platform

To fully understand RAFT’s performance under different scenarios, it is essential to run RAFT on a platform where we have maximal control of the network, including latency, failure, etc. To achieve this, we build an emulation platform where each replica and client is run as a single process, and communication between different nodes is emulated by multi-process communication. This platform emulates real-world communication latency by first randomly generating a 2D coordinate Xi for every node i, and the latency between nodes i and j is proportional to the Euclidean distance between them, i.e. |Xi-Xj|. Besides latency, the platform also supports emulating several common failure modes, like independent node failure or network partition. We also introduce parameters to control the failure probability, p, and length of each failure mu, sigma.

More formally, for independent node failure, at any time, with probability p, node i will fail for an interval of length t sampled from normal distribution with mean mu and standard deviation sigma, if this failure won’t cause more than F failures. We denote this failure mode as Node.
And for network partition failure, at any time, with probability p, the replicas will be divided into two sets S1 and S2, with |S2|≤F for an interval of length t sampled from normal distribution with mean mu and standard deviation sigma, and there will be no communication between these two sets in this interval. We provide three different ways to generate S1 and S2: 1. randomly choose a subset of size uniformly sampled from F+1 to N as S1, denoted as Partition-Normal; 2. randomly choose a subset of size of size F+1 as S1, denoted as Partition-Hard; and 3. |S_1|=F+1 and there is only one non-failure node in common between two adjacent failures, denoted as Partition-Flip.

RAFT

On top of this highly controllable emulation platform, we implement the RAFT algorithm following the paper. Additionally, we implement a state machine that maintains a list variable, and takes “append x to list” requests sent by the client. We choose to support the appending list as our state machine for two reasons: 1. This operation is fast to execute, so the main latency rises from achieving consensus by RAFT, and 2. We can easily validate the correctness of our implementation by letting the client append distinct values to the list and check whether the returned list equals the expected one.

The source code is available on github. The structure of the repository is as follows:

network.py: This file solves a communication graph which contains information about the communication latency between two replicas and time intervals when communication fails.
client.py : This file implements a client who sends append entries requests to replicas, and immediately sends the next request after the previous one is committed. We also implement the code which measures latency and validates correctness in this file.
communicator.py: This file implements communication among replicas and between replicas and clients. When a message is sent from replica i to replica j, it will first check the network status. If the network does not fail at that time, the message will be sent to j according to the latency specified in the communication graph.
replica.py: This file implements the core algorithm of RAFT. We define four types of messages sent among replicas, “req_vote”, “rep_vote”, “req_apd”, and “rep_apd”, correspond to request and reply vote, and request and reply AppendEntries. We define a “Replica” class to store all states of a replica. More detailed states we maintain can be referred from the code.
main.py: This file launches all replicas and client processes. We also implement a “log_queue_status” method to monitor whether the communication queue between two process is overloaded or not.
heap.py: This file implements utils that are used to emulate the network.

Experiment

Experiment Setup

We carry out experiments to try to answer how RAFT’s performance changes when:

number of replicas (denoted as #replica),
failure mode,
failure probability, and
election timeout (denoted as ELE_TO). Note: the definition of this parameter is: in RAFT’s leader election stage, a candidate timeouts randomly in [ELE_TO,2ELE_TO].

changes.

In this project, we use latency to measure the liveliness of RAFT, which is measured by the time between a request sent by the client and committed by the leader. Note that in our experiments, we assume near-perfect communication between client and replicas, where the latency for sending messages between client and replicas is negligible. Thus the latency we measure is almost the same as the time it takes for RAFT to reach consensus. Besides reporting the average latency of all operations, we also report the average latency of operations where term-change happens, which focus more on characterizing the time of leader election. For every configuration, we run our program for 4 different seeds and report the average results of the two metrics (average of all requests’ and term-change requests’ latency). In every experiment, the client is launched for 1 minute, and it immediately sends the next request after the previous request is committed.

Experiment Results

Fig 1. The 5-replica case in our experiment has communication latency in this figure.

1. RAFT’s performance under different numbers of replicas

For this part, we fix the failure mode to Partition-Hard, failure probability p to 0.4, and ELE_TO to 1s and run RAFT of 3, 5, and 7 nodes. We use the same algorithm to randomly generate each replica’s location in the 2D plane (uniformly sampling x and y coordinates from [0,1]) to ensure the expected time to send a message between two nodes is the same when #replica changes. The result (shown below) demonstrates that when there are more replicas, the average latency is larger, and it takes longer to elect the leader, which matches our expectation. Also, it is worth noting that the latency increases more rapidly when #replica changes from 5 to 7, comparing with changing from 3 to 5. This suggests that RAFT may not scale well.

Fig 2. Performance under different number of replicas.

2. Raft’s performance under different failure modes

In all the remaining experiments, the number of replicas is fixed to 5 and the communication latency between replicas is shown in Fig 1. We fix failure probability p to 0.4 and ELE_TO to 1s. The result (shown below) shows that Node and Partition-Normal failure mode have almost similar influence to the system, where as the latency increases when failure mode changes to Partition-Hard and Partition-Flip. This means that the later two are stronger failure modes. We hypothesize this is because in the later two failure modes, whenever failure happens, there are exactly F failures in the system, whereas the former two failure modes may have less than F failures. This result also demonstrates the fewer the number of replicas which are non-failure in adjacent time intervals it is, the more challenging the network becomes, which follows our intuition.

Fig 3. Performance under different failure modes.

3. Raft’s performance under different failure probability

We fix failure mode to Partition-Hard and ELE_TO to 1s, and run RAFT with failure probability p to be 0, 0.2, 0.4, 0.6, 0.8, and 1.0. The average latency for all messages increases as the failure probability increases. The average latency of a term change remains almost the same for different failure probabilities, which means the frequency of failure (as long as the set of which nodes fail doesn’t change too rapidly) has almost no effect on the time to elect leader.

Fig 4. Performance under different failure probability. Note that when failure probability is 0, the client didn’t encounter any leader change, so we mark the term-change latency to 0.

4. Raft’s performance under different election timeout

The best parameter of election timeout depends on the network environment and failure parameters. For this part of the experiments, we fix failure mode to Partition-Hard and failure probability p to 0.4, and run RAFT with ELE_TO equal to 0.5, 1.0, 1.5, and 2.0 seconds. The best choice for ELE_TO turns out to be 1.0s for this setting. We believe this is caused by: If the election timeout is too small, there might be overlap and collisions between multiple candidates as stated in the introduction section, such that the probability of a successful election would be low. On the other hand, if the election timeout is too large, time would be wasted for restarting a leader election term.

Fig 5. Performance under different election timeout.

Conclusion

In this project, we successfully reproduced RAFT and study how available RAFT is by measuring its average latency for all or term-change requests under different scale, failure patterns, and algorithm parameters. We’ve made several interesting discoveries, including:

the network condition is worse for RAFT (RAFT’s performance is poorer) when there are fewer replicas which are non-failure in adjacent time intervals, even the number of failures remain the same,
frequency of failure (as long as the set of which nodes fail doesn’t change too rapidly) has almost no effect on the time to elect leader, and
either small or large election timeout can downgrade RAFT’s performance.

References

In Search of an Understandable Consensus Algorithm. Diego Ongaro and John Ousterhout.