How Redis pipelining helped us improve performance by 15x

Deepak Sreekumar
SAFE Engineering
Published in
4 min readOct 4, 2022
Photo by Julian Hochgesang on Unsplash

Scaling issues are usually the most challenging, frustrating, and ultimately most rewarding problems to solve. With the number of customers and the scale steadily increasing for SAFE, our CRQM platform, our teams are faced with such problems almost every day.

We woke up one day to such an issue that our SIT team (System Integration Testing) had graciously raised overnight. One of the workers attached to the Amazon MQ broker was performing extremely slow under load.

Yet another internet explorer meme was born..

This worker was responsible for prioritizing and categorizing security findings across different assets based on their severity. It was making use of a data structure called Sorted Sets in Redis, to maintain a leader board and perform related operations. It had to execute a lot of such Redis commands for every message. After we did a couple of quick tests to eliminate the DB (MySQL) as the bottleneck, naturally, Redis was the next place to look into.

We had recently moved to AWS Elasticache (more on that here) as an alternative to the container-based Redis server. Although it brought in a lot of advantages for us, we cannot expect the same round-trip time (RTT) of a container running on the same machine from the managed service. Some slowness was to be expected but not what we were observing. The first step was to see if the increased network latency would be the root cause here. We performed the same load test using the container-based Redis and surprisingly, the slowness was still reproducible in this setup, which ruled out network delay as the culprit.

Time to bring out the big guns...

We drilled down a bit further with the help of Continous Profiler from DataDog and observed that most of the time during the execution of the worker is taken by the Redis operations. This was a promising lead as ZSET operations can be slow at times.

  • We needed to track all the queries and their execution time at the Redis server side during the processing of a message. This was straightforward to do by setting the slow log query threshold to 0
config set slowlog-log-slower-than 0
  • We calculated the sum of individual execution time for all queries but curiously, it was contributing a very small percentage to the total execution time (5ms out of 350 ms) of the worker to cause any significant slowness. This eliminated the possibility of the Redis queries themselves being slow.
  • A detailed investigation into the flame graphs from the profiler highlighted the writeUtf8String as a time-consuming operation.
  • We were using ioredis as the Redis client and a quick look at their GitHub repo showed that this is part of the method responsible for writing the Redis queries to the command queue. (We also quickly tried out some alternative libraries but the performance was much worse).
  • The worker under the microscope was executing ~1500 individual Redis commands for every message. Those who are familiar with Redis internals might be quick to rightly point out that this is against best practices and could cause a slowdown.
  • For such use cases, Redis pipelining is a recommended technique for improving performance by issuing multiple commands at once without waiting for the response to each command.
  • However, to use pipelining, the worker should be in a position to execute the commands at once without caring about the order of execution. This was not an option for us with the current structure of the worker as the order of execution mattered to the workflow.

Time for…

With some effort, we were able to refactor the worker to split up the 1500K individual execution into 3 batches (write, read, write), which needed to be executed in order. We used pipelining to execute each of these batches.

After an anxious run of the load test, the Terminal stats flashed green informing us that the worker was 10–12x faster! The writeUtf8String method no longer consumed a large chunk of the execution time! Hurray 🎉 !

Some additional refactoring helped us to limit the number of batches to 2 (reducing the total Redis commands by 30%). This helped us to get to a 15x improvement from the initial state!

In Conclusion

Redis pipelining should be the way to go when there is a requirement to run a large number of commands in a short period, as long as you are not worried about the order of execution. It can help to reduce not only the RTT by avoiding the TCP overhead of individual ops but also the Socket IO cost since the command read/writes are delivered by single calls.

JIRA purred in its sleep while one HIGH bug on the board moved to Done...

The movie continues where our troopers clash against the onslaught of bugs. Please clap and follow us if you’d like a front-row seat to the battle!

--

--