Preventing Pipeline Calls from Crashing Redis Clusters

By Michael Cartmell , Jiahao Huang & Sandeep Kumar

Grab
Grab
6 min readMar 23, 2020

--

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement. During this period, roughly only 1 in 21 calls to Apollo, a primary transport booking service, succeeded. This brought Grab rides down significantly for the one minute it took the Redis Cluster to self-recover. This behavior was totally unexpected and completely breached our intention of having multiple replicas.

This blog post describes Grab’s outage post-mortem findings.

With Grab’s continuous growth, our services must handle large amounts of data traffic involving high processing power for reading and writing operations. To address this significant growth, reduce handler latency, and improve overall performance, many of our services use Redis — a common in-memory data structure storage — as a cache, database, or message broker. Furthermore, we use a Redis Cluster, a distributed implementation of Redis, for shorter latency and higher availability.

Apollo is our driver-side state machine. It is on almost all requests’ critical path and is a primary component for booking transport and providing great service for customer bookings. It stores individual driver availability in an AWS ElastiCache Redis Cluster, letting our booking service efficiently assign bookings to drivers. It’s critical to keep Apollo running and available 24/7.

Because of Apollo’s significance, its Redis Cluster has 3 shards each with 2 slaves. It hashes all keys and, according to the hash value, divides them into three partitions. Each partition has two replications to increase reliability.

We use the Go-Redis client, a popular Redis library, to direct all written queries to the master nodes (which then write to their slaves) to ensure consistency with the database.

For reading related queries, engineers usually turn on the ReadOnly flag and turn off the RouteByLatency flag. These effectively turn on ReadOnlyFromSlaves in the Grab gredis3 library, so the client directs all reading queries to the slave nodes instead of the master nodes. This load distribution frees up master node CPU usage.

When designing a system, we consider potential hardware outages and network issues. We also think of ways to ensure our Redis Cluster is highly efficient and available; setting the above-mentioned flags help us achieve these goals.

Ideally, this Redis Cluster configuration would not cause issues even if a master or slave node breaks. Apollo should still function smoothly. So, why did that February Apollo outage happen? Why did a single down slave node cause a 95+% call failure rate to the Redis Cluster during the dim-out time?

Let’s start by discussing how to construct a local Redis Cluster step by step, then try and replicate the outage. We’ll look at the reasons behind the outage and provide suggestions on how to use a Redis Cluster client in Go.

1. Download and install Redis from here.

2. Set up configuration files for each node. For example, in Apollo, we have 9 nodes, so we need to create 9 files like this with different port numbers(x).

3. Initiate each node in an individual terminal tab with:

4. Use this Ruby script to create a Redis Cluster. (Each master has two slaves.)

5. Finally, we try to send queries to our Redis Cluster, e.g.

Redis Cluster Server

As long as the majority of a Redis Cluster’s masters and at least one slave node for each unreachable master are reachable, the cluster is accessible. It can survive even if a few nodes fail.

Let’s say we have N masters, each with K slaves, and random T nodes become unreachable. This algorithm calculates the Redis Cluster failure rate percentage:

If you successfully built your own Redis Cluster locally, try to kill any node with a simple command-c. The Redis Cluster broadcasts to all nodes that the killed node is now unreachable, so other nodes no longer direct traffic to that port.

If you bring this node back up, all nodes know it’s reachable again. If you kill a master node, the Redis Cluster promotes a slave node to a temp master for writing queries.

With this information, we can’t answer the big question of why a single slave node failure caused an over 95% failure rate in the Apollo outage. Per the above theory, the Redis Cluster should still be 100% available. So, the Redis Cluster server could properly handle an outage, and we concluded it wasn’t the failure rate’s cause. So we looked at the client side and Apollo’s queries.

Golang Redis Cluster Client & Apollo Queries

Apollo’s client side is based on the Go-Redis Library.

During the Apollo outage, we found some code returned many errors during certain pipeline GET calls. When Apollo tried to send a pipeline of HMGET calls to its Redis Cluster, the pipeline returned errors.

First, we looked at the pipeline implementation code in the Go-Redis library. In the function defaultProcessPipeline, the code assigns each command to a Redis node in this line err:=c.mapCmdsByNode(cmds, cmdsMap).

Next, since the readOnly flag is on, we look at the cmdSlotAndNode function. As mentioned earlier, you can get better performance by setting readOnlyFromSlaves to true, which sets RouteByLatency to false. By doing this, RouteByLatency will not take priority and the master does not receive the read commands.

Now, let’s try and better understand the outage.

  1. When a slave becomes unreachable, all commands assigned to that slave node fail.
  2. We found in Grab’s Redis library code that a single error in all cmds could cause the entire pipeline to fail.
  3. In addition, engineers return a failure in their code if err != nil. This explains the high failure rate during the outage.

Our next question was, “Why did it take almost one minute for Apollo to recover?”. The Redis Cluster broadcasts instantly to its other nodes when one node is unreachable. So we looked at how the client assigns jobs.

When the Redis Cluster client loads the node states, it only refreshes the state once a minute. So there’s a maximum one minute delay of state changes between the client and server. Within that minute, the Redis client kept sending queries to that unreachable slave node.

What happened to the write queries? Did we lose new data during that one min gap? That’s a very good question! The answer is no since all write queries only went to the master nodes and the Redis Cluster client with a watcher for the master nodes. So, whenever any master node becomes unreachable, the client is not oblivious to the change in state and is well aware of the current state. See the Watcher code.

Redis Cluster Client

One way to avoid a potential outage like our Apollo outage is to create another Redis Cluster client for pipelining only and with a true RouteByLatency value. The Redis Cluster determines the latency according to ping calls to its server.

In this case, all pipelining queries would read through the master nodesif the latency is less than 1ms ( code), and as long as the majority side of partitions are alive, the client will get the expected results. More load would go to master with this setting, so be careful about CPU usage in the master nodes when you make the change.

Pipeline Usage

In some cases, the master nodes might not handle so much traffic. Another way to mitigate the impact of an outage is to check for errors on individual queries when errors happen in a pipeline call.

In Grab’s Redis Cluster library, the function Pipeline(PipelineReadOnly) returns a response with an error for individual reply.

Instead of returning nil or an error message when err != nil, we could check for errors for each result so successful queries are not affected. This might have minimized the outage's business impact.

Go Redis Cluster Library

One way to fix the Redis Cluster library is to reload nodes’ status when an error happens.In the go-redis library, defaultProcessor has this logic, which can be applied to defaultProcessPipeline.

We’ve shown how to build a local Redis Cluster server, explained how Redis Clusters work, and identified its potential risks and solutions. Redis Cluster is a great tool to optimize service performance, but there are potential risks when using it. Please carefully consider our points about how to best use it. If you have any questions, please ask them in the comments section.

Originally published at https://engineering.grab.com.

--

--

Grab
Grab
Editor for

Southeast Asia’s leading Online to Offline (O2O) mobile platform, providing the everyday services that matter most to consumers. www.grab.com