Single leader replication — Node Outages

Srikant Viswanath
Architectural Tapas
5 min readJun 21, 2020

In this post we will look at some of the high level options of dealing with node outages in a single leader replication database setup.

Any database node could go down unexpectedly due to fault or a scheduled maintenance. We wouldn’t want to disrupt our business operations in such an event in the spirit of high availability.

Follower failure

A follower in this context is used as a read-replica. What happens if the network between follower and leader is temporarily interrupted or if the follower crashes?

Digression: When a node is cut off in terms of network connectivity(the actual physical cable may be corrupted, or the node is at capacity and not responding or it has just crashed) to other nodes, this is often referred to as network partition — which denotes P in CAP theorem.

Every follower keeps a log of data changes it has received from the leader on its local disk. The last transaction that was processed by the follower before the fault occurred is known to the follower. Hence, once the fault is rectified — it can connect to the leader and ask for all the transactions that it missed based on the last transaction id. When the missed transactions have been applied, the follower is said to have caught up and can continue receiving the stream of data changes as before.

There is a very strong similarity with setting up new followers

Crank up the difficulty — Leader failure a.k.a Failover

Handling the failure of a follower is quite straightforward — due to the fact that the interruption was localized to that one follower. Now, what if there is a failure at the leader. Obviously the blast radius is much bigger because — there will be no more writes into the system.

At a high level, one of the followers needs to be promoted to be the new leader followed by clients(or the writes balancer) need to be reconfigured to write to the new leader. In addition, all the remaining replicas need to start consuming data changes from the new leader. This process on the whole is called failover.

Simple enough? — Not quite…

For an automated failover solution, there are multiple things to consider and as an extension things can go very wrong with no perfect solution. Lets see what they are in brief:

How to tell if leader is dead? — As briefly touched upon earlier in the digression, there are many things that can go wrong with no foolproof way to tell what exactly has gone wrong. Timeouts are used as a common “brute force” solution by many systems, i.e., nodes frequently bounce messages to and fro between each other. If a node doesn’t respond within the cutoff — it is assumed to be dead.

Detecting leader is dead

What is the right amount of timeout? — A longer value means the longer it takes to detect and kick in the failover recovery mechanism. However, a shorter value would tend to create unnecessary failovers. For instance, a leader already under a temporary load spike could be taking longer to respond. If the response time happens to go beyond the timeout threshold, it will add an unnecessary overhead of a failover to an already under-stress system.

Choosing a new leader: This is generally a democratic process, done via an election amongst rest of the followers. The new leader is chosen via a majority(a pure majority, sorry no electoral college here). The best candidate ends up being the follower with the most up-to-date data from the previous leader. This is one of the use cases of a general category of problem called consensus problem.

Consensus to elect a new leader

If asynchronous replication is used, it could happen that the new leader does not have the most up-to-date data.

Old leader’s back!: What if the old leader has come back from the dead and continues to think that it is the leader. This situation is often called split brain, i.e., there are 2 nodes in the system that think they are the leaders. It gets dangerous if both the nodes continue to accept writes — data is likely to be lost or corrupted if there is no proper write conflicts resolution. This situation is similar to Multi-leader replication.

A usual fallback option tends to be to detect this split-brain situation and shut down one of the leaders — thus causing a mini ricochet of a failover.

Split brain scenario where 2 leaders assume they are the leaders

Another dangerous situation is when the new leader does not have most recent data and the old leader rejoins the cluster. In the meantime, the new leader might have received conflicting writes. A surprisingly common solution is to discard old leader’s unreplicated writes violating clients’ durability expectations

Reconfigure writes to use the new leader: Clients’ reads now need to be re-routed to the new leader. This is usually referred to as request-routing class of problems which applies to beyond the scope of failover.

There a 2 high level patterns that can be used here:

  • Clients send their writes to a routing tier first which is kept up-to-date with the leader information, which then forwards to the latest leader. A coordination service such as Zookeeper is usually used to coordinate cluster level meta-data. This routing node can subscribe to the meta data from coordination service which notifies if there is a change in leader
  • A more internal services scenario can just have the clients be aware of the leader and directly write to it — removing the overhead of an intermediary

The first solution applies to a broader set of problems including partition(could be at database or application or cache level)

Conclusion

Due to some of the problems discussed above briefly and the fact that there are no foolproof solutions — it is many a times preferable to do manual failovers if you can afford it based on your application’s downtime appetite, data consistency strictness and scale

--

--