How we tamed the Redis Cluster Down demon

Games24x7
Life at Games24x7
Published in
3 min readOct 5, 2020

We, at Games24x7, are heavily dependent on Redis. We use it for caching a lot of real-time data which includes multi-player game data, managing game leaderboards, device session data, etc. The Redis cluster, going down even for a couple of seconds, can create havoc in our system which can lead to cascading effects, and eventually games getting stuck! Bad user experience! Revenue loss!

Redis cluster down errors started coming intermittently a few weeks back in one of our own hosted Redis clusters on AWS EC2 machines. It was accompanied by a bunch of other errors like socket errors and redirection errors. It could not be because of inter-AZ packet loss as all our Redis nodes reside on a single AWS AZ.

Also, we are running a large Redis cluster of 32 nodes, and it is capable of handling very high IO. So the volume of IO which we were performing, i.e. 15K IOPS on our cluster could not be the reason for this cluster down issue.

After careful scrutiny of Redis server logs, we noticed that the cluster down had a repetitive pattern of being in the fail state for exactly 5 seconds.

After finding this cluster failure and recovery pattern from the Redis server log, we started exploring the Redis cluster configs and their values. While going through the Redis cluster config, we got to know about the following configurations.

cluster-node-timeout<millisecond>: The maximum amount of time a Redis Cluster node can be unavailable, without it being considered as failing.
lua-time-limit<millisecond>: The maximum time a script can be executed.

For the above 2 configurations, the values were set to 5 seconds. Interesting! Isn’t it? This pointed to the fact that the culprit could be a long-running Redis command or Lua-script which was causing the Redis master node to fail for 5 seconds and eventually the cluster to be unreachable.

To find the long-running Redis command, we changed the Redis slow logging configuration to log all the long-running Redis commands or Lua-scripts. This slow logging revealed that there was indeed a Lua-script which was taking around 1.5 seconds and it was getting executed consecutively multiple times. The cumulative execution time for this batch of Lua-scripts was greater than 5 seconds. After fixing this Lua-script, Redis errors like cluster down, socket errors, and redirection errors stopped coming completely.

Redis redirection error before and after Lua-script fix
Redis socket time-out error before and after Lua-script fix

Conclusion

Any Redis command, which takes a long time to execute, can affect the cluster state, primarily because Redis is a single-threaded server, and heartbeats which are ping/pong are nothing but Redis commands. Delay in ping/pong from the server due to a long-running Redis command may mislead the cluster to believe that a node could be down and eventually causes the cluster to become unavailable.

References

--

--