Redis as a highly available database

Lev Petrushchak
The Quiq Blog
Published in
2 min readApr 24, 2020

Once upon a time, we decided to store some data. We have chosen Redis — in-memory data structure store, which may be used as a database, cache and message broker and we are using pretty much all of it.

We started with one popular hosted Redis solution which in our case was nice and simple. But as the product grew, we were getting more and more strict requirements, like 100% uptime, control over maintenance windows, splitting DBs with no downtime, and so on.

Moreover, if your company wants to pass a security audit (like SOC 2) you have to keep all data encrypted. This, of course, adds more overhead to your application and latency to requests.

As a result, we decided to move out of a remotely hosted service in order to keep Redis and application services together in the same place.

Since we are using AWS as our main cloud provider, it was obvious to try Amazon ElastiCache Redis. And here we go again. In terms of Amazon, it’s just a cache and they don’t seem to care about data loss. When you need to reboot the whole cluster, all data will be flushed. In case of failure, acceptable failover time is near 1 minute which was not ok for us. One more fun thing about ElastiCache is that you can’t change the DB password, you just have to recreate your database.

It was more than enough now to understand that we needed something custom. Luckily, Redis is open source and it has Sentinel as HA so we decided to try it.

After lots of hard research and testing, we finally built it. High available Redis cluster https://github.com/Quiq/ha-redis which can lose master and failover will take up to 7 seconds. What is most important is that we have confidence that no accidental writes will happen to old master if it was simply restarted. With Haproxy we can have SSL encrypted transport from application to Redis host (which is still not yet supported by Redis itself). The application is connected to Sentinels and can follow them with all failovers. With additional Python scripts, we can make manual failover in less than 2 seconds. We covered a lot of other edge cases to make a reliable, highly available Redis cluster.

Please try out our solution and send us any feedback. This code has been used in production since last June 2019 with no downtime or data loss. This included upgrading Redis and kernels. Having an always-on database is one of the biggest challenges to startups so hoping other companies will find this helpful.

--

--