Highly available Redis Architecture

Sidhartha Mani
Koki
Published in
4 min readFeb 12, 2018

Redis is one of the most widely used container images in the world¹. It is an open-source, in-memory, data-structure store. It is used as a cache, message broker, and as a database. It is classified as a stateful service.

Running a stateful service such as Redis (with or without Kubernetes) requires careful architectural planning. This article goes over the architectural concerns that need to be addressed to confidently and correctly run Redis.

Stateful services may require complex orchestration

Production Planning for Stateful Applications

Planning to run stateful services in production requires carefully addressing at least the following set of concerns. These concerns do not just apply to Redis — an instance of this framework-of-thinking has been applied to PostgreSQL in one of our previous articles — https://goo.gl/G1QW4C

High Availability

  1. Write Consistency guarantees
  2. Replication
  3. Clustering & Sharding
  4. Load Balancing

Fault Tolerance

  1. Failover
  2. Backups and Restore

Observability

  1. Monitoring — Performance, and Uptime

Management Concerns

  1. Migration
  2. Upgrades and Downgrades

Infrastructure Planning

  1. Network Partitioning
  2. Service Distribution across Failure domains

Once these concerns are addressed and tested, any Stateful service can be confidently deployed in production.

The Redis Architecture

The Redis 3.0 release consists of the following components :

  • Redis Server
  • Redis CLI (and other clients)
  • Redis Sentinel
  • Redis Cluster

High Availability

Reasoning about high availability can be done by addressing the various ways high availability can be disrupted.

High availability can be described as the ability to serve requests even in the face of failures.

The possible mechanisms of disruption of availability with respect to stateful services include —

  1. Data Inconsistency — What is written is what should be available to read (with reasonable guarantees). Stale, older data should not be a served. Redis has an inbuilt mechanism of client acknowledgement to ensure Data consistency.
  2. Write load — Since writes are significantly more expensive than reads, failures happen at much lower request rate. Mechanisms to distribute write loads are mandated in such situations.
  3. Data loss and Data Corruption — Node failures, Process crashes, and Disk failures can lead to data loss. Mechanisms to combat these are important.

Addressing Write Loads

Redis Cluster is the primary mechanism for high availability. It provides mechanisms for sharding. A Redis shard is a subset of 16384 buckets that run in a particular failure domain. For sharding purposes, failure domains are generally hosts.

In sharded clusters, writes will be distributed among various shards and the failure of one shard will be isolated and will not lead to the failure of the entire cluster.

Containerized platforms such as Kubernetes require careful consideration to ensure that two containers, each representing a shard, do not run on the same failure domain.

Tolerating Data Loss and Corruption via Replication

Replication requires running the Redis cluster in a Master-Slave fashion.

The Redis Server can run in two modes:

  • Redis Master
  • Redis Slave

Redis Master is the source of truth and the write-head for the Redis cluster. The Redis slave (also known as replica) acts as a standby server for redundancy purposes. Redis slaves can be used to serve read-requests when run in a replicated fashion.

The Redis slaves connect to Redis masters to receive data and continuously synchronize the data with the master. This provides redundancy through replication. Redundancy addresses the above concern of Data loss and corruption.

There are limits to the tolerance of data loss and corruption. This is partially due to the nature of replication between master and slaves. The master-slave replication is done asynchronously, and therefore an acknowledged write may not be available in the slave. A master failure before asynchronous streaming of data to slave would lead to data loss.

Mechanisms such as multipath writes, disk level replication etc. can be used to prevent the data loss scenario described above.

The other mechanism of data loss and corruption is where all of the masters and slaves fail. Mechanisms such as distribution of replication across failure domains (across availability zones) can be implemented to combat this failure.

Quick Notes on Kubernetes

Kubernetes provides excellent mechanisms to orchestrate the setup and provisioning of Highly available Redis clusters that address sharding, and replication.

When sharding, it is important to ensure that two shards do not run on the same host. This can be done using anti-affinity rules. Secondly, using node selector or node affinity, it is possible to select different availability zones for masters and slaves.

The configuration of this cluster can be managed as one config map, and a controller can be used to orchestrate the whole process. This will lead to a 1-click-install experience for setting up and managing a HA Redis cluster.

Next Steps

This article describes architectural planning for running highly available Redis Clusters. We still haven’t addressed how to make this cluster fault tolerant —

Fault tolerance can be described as the ability to recover from failures of operation of a service.

The next blog post will discuss fault tolerance in Redis. Soon-to-be-released Koki Installer for Redis will provide highly available and fault tolerant Redis for your Kubernetes cluster. Stay tuned for more updates!

Legal

Redis and the Redis logo are the trademarks of Salvatore Sanfilippo in the U.S. and other countries.

--

--