HA Redis for “smart” and “dumb” clients

Nathan Williams
Treehouse Engineering
5 min readApr 20, 2016

Redis is an in-memory data structure store that can be used as a lightweight database, caching, or queuing system. Due to its speed and flexibility, it is a very popular infrastructure component, and can be used to solve a variety of problems, such as leaderboards or fraud scoring, among many, many others.

But, when selecting infrastructure components to build application functionality on, aside from features, one of the most important factors is the ability of the service to scale, and to be made highly-available. Redis provides a solution for active/passive high-availability called Redis Sentinel, and Sentinel’s approach to HA is elegant in its simplicity, much like Redis itself.

In short, Sentinel runs as a cluster of processes that monitor and manage a group of Redis servers using gossip protocols, as well as providing service discovery and failover notification to clients. By running Sentinel next to your clients, your election can even be based on the clients’ perspective of server availability! Sounds pretty cool, right?

But wait…

As a consequence of this approach, high-availability for Redis depends on support for Sentinel in the client, which needs to know how to ask Sentinel where to connectinitially, and handle failover notifications. While there are many good Redis libraries available for most languages that support sentinel, not all of your applications may initially be written to use them. You may also find that not all of the Redis clients in your infrastructure support Sentinel (two fairly common ones that we’ve encountered are Hipache and Logstash).

So, keeping these “dumb” clients functional in the event of a failover requires either dynamic re-configuration and reloading, or some kind of way for the traffic to be re-routed to the new Redis master.

How we do it at Treehouse

This could be done by connecting the “dumb” clients to Redis via a proxy like HAProxy or twemproxy, and building failover support into the proxy configuration, or by deploying a virtual IP (VIP) and tying that VIP to the state of the Redis cluster.

Offloading to a proxy just moves the goal-post, because you then have to determine how to make your proxy HA, unless you run local proxies on every dumb client node, similar to SmartStack; but this is a lot of extra work for a pretty simple problem. Not wanting to just add a proxy layer and then still be faced with making it HA, and not wanting to run tons of local proxies for the dumb clients, we elected to use a VIP on the server side.

The next choice of course is what you to use to manage the VIP. We already use keepalived for VIP management in several other parts of our infrastructure (e.g. our load-balancers), so keepalived was a no-brainer for us, though something like the Pacemaker/Corosync combo would have worked fine as well. At Treehouse, we tend to use Pacemaker/Corosync in places where we need more advanced cluster logic, but for simpler active/passive setups, keepalived does quite well.

So how to implement a keepalived-managed VIP for the cluster, then?

The first thing to do is to allocate a VIP. In OpenStack, this is done by creating a new port in the appropriate subnet, and adding it to the allowed_address_pairs for the ports associated with your Redis server instances. We’ve made our OpenStack Heat template for this available here.

In AWS this is a little trickier, as you need to create an EIP and then reassign it to the new master on failover. We haven’t had to do this, so YMMV, but in theory at least, replacing the virtual_ipaddress clause in our example keepalived configuration with a notify_master script to handle the EIP reassignment should do the trick.

After allocating the VIP, the next step is to install keepalived on your Redis server nodes, configure it to manage the VIP, and to track the Redis master. In our research, all the solutions we found for keepalived and Redis involved replacing sentinel as the cluster manager, with keepalived actually handling master promotion and slave configuration via notify_* scripts. This isn’t as convenient as using Sentinel for the “smart” clients though, and fails to account for the perspective of the clients as to which Redis servers are reachable. We wanted a solution that would take a more passive role, one in which the VIP would merely “follow” the Redis master rather than managing it.

Happily, keepalived’s vrrp_scripts are perfect for this, and Redis exposes this information in a way that’s perfect for our purposes. The solution we came up with looks like this.

This configuration is applied to each Redis server, with equal weights. The chk_redis_master vrrp_script performs a very simple query against the local Redis server to determine if it is the master, and the check weight is either added or subtracted from the servers priority score based on the exit status of the vrrp_script. The highest scoring server then acquires the VIP, and announces the address via a gratuitous arp reply. When Sentinel performs a failover of the cluster, the new server ends up with the higher score, and keepalived migrates the VIP to the new server, updating the client arp tables via another gratuitous arp. Voila!

Gotchas

Using a VIP like this solves for address reassignment, but since Redis clients talk to Redis servers over long-lived, persistent TCP connections, moving the IP they connect to only solves half the problem. If you’re using a stateful firewall, your dumb client needs to be able to handle reconnections, especially if they use something like Redis’ BLPOP which may not raise timeouts on the client side.

There are several solutions we considered for this:

First, you could keep your stateful firewall’s session table in sync between your servers. Conntrackd works well for this, and is frequently deployed in combination with keepalived for this purpose. We didn’t have prior experience with this tool, so we haven’t rolled it out yet.

Alternatively, you could force a re-connection from the server side by inserting (and then removing) a firewall rule to issue a tcp-reset against connections on the VIP address. But, depending on your topology, this has a high likelihood of impacting clients that have already detected the issue and reconnected, or missing clients that haven’t sent any additional packets — as may be the case with clients waiting on the response to a BLPOP.

In our case, we just monitor the few dumb clients’ health from the client side and force reconnects if they’re stalled. We get additional value from doing it this way because we also detect client health issues that aren’t tied to Redis failovers, and we love solutions that solve for more than one problem.

Conclusion

Redis is a fantastic tool for building modern application stacks on, and while there are unfortunately a number of common utilities out there that don’t currently support Sentinel, it’s possible to provide a small amount of glue to help these “dumb” clients bridge the gap in order to achieve high availability.

We’re pretty happy with our solution, since it’s capable of supporting both “smart” clients via Redis’ standard methods, as well as supporting “dumb” clients, but we’re always interested in clever hacks and we’d love to hear how you’ve solved this problem if you’ve had to solve it as well!

Nathan is a Systems Developer on the Engineering team at Treehouse. We’re on a mission to design, build, and maintain the best online learning service in the world. Sound awesome? Join us!

If you enjoyed this post, please tap or click the to help us promote it others. For more posts like this one, be sure to follow the Treehouse Engineering publication. 👋

--

--

Nathan Williams
Treehouse Engineering

Chief DerpOps Thought Looter, Systems Developer at Treehouse.