Fault Tolerant Redis Architecture

Sidhartha Mani
Koki
Published in
5 min readFeb 13, 2018

Redis is an open-source, in-memory data-structure store used as a cache, message broker and database. It can be configured to run in a highly available fashion. High availability can be achieved by removing or reducing the probability of service disruption.

However, it is impossible to prevent each and every kind of service disruption. When a disruption does occurs, the ability of the system to continue functioning is defined as fault tolerance.

The ability of a system to continue functioning in the face of failures is known as fault tolerance.

A starfish can survive even if it loses an arm.

Fault Tolerance

Reasoning about fault tolerance can be done by assessing the surface area of damage in case of failures, which can be mitigated by spanning across fault domains.

In order to asses the surface area of damages in case of failures, the different types of failures need to be understood. The various failure scenarios include:

  • Node failures — Disk crashes, overheating, or accidentally deleting a node on which the service is running can lead to failures
  • Network failures — Network device failure, channel overload, or misconfiguration can lead to service disruption
  • Application-specific failures — In case of redis, it is possible for a client to try to write to a stale master after a failure scenario where some other instance has taken over as master. In this case, data loss can occur.

Tolerating Node and Network failures

Running a service on one node, writing data into one network disk, or isolating all Redis instances to one availability zone can all lead to complete service disruption. This is because the fault domain spans across the whole service.

The best mechanism to survive such failures is to reduce the surface area of damage by spanning across multiple fault domains —

  • Run multiple masters spread out across multiple nodes. (A node is one fault domain)
  • Nodes should span across availability zones. (Availability zone is one fault domain)
  • Less performance-sensitive workloads, such as data warehousing and analytics can even be spread out across regions. (Region is one fault domain)

Redis Sentinel

Redis Sentinel is the primary mechanism to setup and run Redis in a fault tolerant manner. Redis Sentinel is a distributed service that continuously checks for the health of masters and slaves running in the cluster, and performs a fail-over if needed.

If it detects a failure or is unable to reach a particular instance, it tries to verify with other sentinels if that instance is actually down.

When the Sentinel first detects that a master is down, it designates the master a new state known as S_DOWN. This state says that the current master is Subjectively Down.

Then, it communicates with other sentinels, and if a preconfigured number of sentinels agree that the master is down, then the state is updated to O_DOWN, or Objectively Down. Then it proceeds to “vote” for the new master. The voting process requires a majority of the sentinels to agree on the new master.

The process of electing a new master to take over the tasks of a failed master is known as failover.

There are important considerations to make when setting up Sentinel for Fault Tolerance —

  • The Redis client should support Sentinel.
  • The architecture of the Sentinel-based cluster should promote robustness. It is possible to configure the cluster in such a way that sentinels cannot function effectively
  • The sentinel itself should be setup in a highly available and fault tolerant manner.

Fault Tolerant Architecture — I

Consider this setup of Redis Master and Sentinel. The larger box is one node. The smaller box can be considered one container.

+------------+
| +---+ | M - Master
| |M1 | | S - Sentinel
| +---+ | SL - Slave
| |S1 | |
+---+---+----+

If the master process dies, the sentinel will detect that the process is dead. It will know that the master is down. However, it won’t be able to bring the service back up. This setup is not fault tolerant.

Fault Tolerant Architecture — II

+------------+  
| +---+ | M - Master
| |M1 | | S - Sentinel
| +---+ | SL - Slave
| |S1 | |
+---+-|-+----+
|
+-----|------+
| +-|-+ |
| |SL1| |
| +---+ |
| |S2 | |
+---+---+----+

There setup consists of two instances — one master and one slave. If the master process dies, then the sentinels can detect it and promote the slave. This setup can tolerate master process failure.

However, if the node running master dies, then both M1 an S1 will die. In this case, the remaining sentinel will not be able to perform a failover, since it requires a majority of the total sentinels to agree on the new master. This setup will not tolerate master node failure.

Fault Tolerant Architecture — III

+------------+  
| +---+ | M - Master
| |M1 | | S - Sentinel
| +---+ | SL - Slave
| |S1 | |
+---+-|-+----+
|
+-----|------+ +-----|------+
| +-|-+ | | +-|-+ |
| |SL1| | | |SL2| |
| +---+----+------------+---+---+ |
| |S2 | | | |S3 | |
+---+---+----+ +---+---+----+

This setup consists of three instances — one master and two slaves instances. Each of the instances are run on different nodes. If the master process dies, or if the master node dies, the majority vote can be done, and one of the slaves can be promoted to be the new master. This setup is tolerant to master process failure, and master node failure.

This is the minimally fault tolerant Redis Setup that should be considered for production.

In this setup, if one master and any one of the slaves fail, then the consensus cannot be reached to promote the left over slave to master. This setup cannot tolerate the failure of one master and one slave node.

Beyond bare-minimum fault tolerance

The above architectures serve as reference architectures for fault tolerance and reasoning about them. The other important fault tolerance mechanisms are:

  • Restart processes that die unexpectedly
  • Do not couple the life-cycles of two components.

If any of the instances (process) dies for any reason, using a system like Docker or Kubernetes to automatically restart those processes will ensure that this architecture is always satisfied. Otherwise, this architecture might be honored during setup, but not after the failure of one of the processes.

In the fault-tolerant architecture III presented above, if any one of the sentinels were run on separate node, then the lifecycle of the sentinel and the lifecycle of the master will not be tied to the same nodes. This will ensure that the above setup can tolerate the failure of one master and one slave, making it more fault tolerant.

Next steps

This blog post serves as an architecture reference for fault tolerant Redis setup.

Stay tuned for more architectural deep dives into Redis and announcements of automation technologies for production ready architectures!

Legal

Redis and the Redis logo are the trademarks of Salvatore Sanfilippo in the U.S. and other countries.

--

--