Cold Cache, Warm Heart

Kent Hoxsey
Feb 15, 2018 · 7 min read

Recently, we did a hard thing in a stressful situation. And in getting it done, accomplishing the goal, some of my teammates came up with some brilliant ideas. Not just one person, not just one idea, but many. This blog post is about the experience of bringing smart people together to handle a challenge. And a cool hack.

The Pain — AWS and Spectre/Meltdown

Nodes in a cluster run hotter as mitigations hit

To make up for that loss, we scaled up wherever we could. We added nodes to any service that even looked like it might slow down.

The Impact

Suspicious Behavior

Even as we started investigating, we had a suspect in mind. Many of our services use Elasticache Redis nodes to cache database queries, and we were hitting the maximum network throughput those cache nodes can manage. This problem has a clear signature in our monitoring, an obvious flat line in the network charts.

Remediation

Unfortunately, no. This is where the story begins.

Whoops

For most of our caches this was simply an exercise in due diligence, analyzing the system for impact, estimating the danger, and attempting to mitigate the effect. Sometimes it is possible to scale up the data source, but often we took the simpler path of identifying less-critical times of day when the load would be more manageable.

But one of our caches was more than just a cache of data, it was the only source. As we were discussing its criticality with the developers reviewing the caching code, we realized that the code was running a state machine and caching the state between events. There was no source for the data except for the incoming stream of events and the actions of the code itself.

One important service, no fallback data source

No system to query from, no way to apply our method. To make matters worse, this particular cache is quite important to our users. We can not tolerate any real downtime from this cache, and starting from a cold cache was out of the question.

We needed to figure out a way to create the new cluster offline and warm it up before switching the live flow over.

Elasticache Snapshots

In this case, the amount of downtime to spin up a new cluster was unacceptable, even without figuring in the time to take a snapshot. We discussed how to manage a transition like this, but set the idea aside for more desperate times.

(Lack of) Replication tools

One of our teammates suggested building a simple redis proxy to pass through the commands to our main cache, but also play them against the new cluster to warm it up. We discussed this idea for a while but ultimately concerned about testing it out for the first time under such dire circumstances. Again, we set this idea aside to see what else we could figure out.

What we were seeking in our time of need was a solution that used the systems and skills we already had. Improvising under duress is hard enough, doing so with the added uncertainty of new tools and new code elevates the stress to brain-bending levels.

Eureka

But wait, there’s more. Another thing you can do with NSQ is to create ephemeral topics and channels, which are not buffered to disk but instead drop messages after the queue reaches a certain size. In the past we have used these as a quick hack to create a dev-null stream, a way to sink the output from a service without disrupting the rest of the system.

Duplicating the event stream into a clone of the stateful service

Knowing that we could easily duplicate the entire service inputs and outputs without disrupting the main production service got us thinking. We could run an entirely separate copy of our problem service just for its side effects, for the sole purpose of populating a new cache. At this point we began to get a little excited, and to map out the configuration we would need to fire up a clone service to warm up a new cache cluster.

The Really Hard Part

In the case of our state-machine cache driver, we configure the NSQ details in Consul K/V values and let consul-template rewrite local configuration files when we make changes. The service itself only loads state on startup, a property that turns out to be important in this case.

The plan was to turn off consul-template on the running service group, effectively disconnecting those machines from the config changes we were about to make. With that done, we could reconfigure all of the NSQ settings and start up a second full copy of the cache driver service pointing at the new cluster.

That second version of the service consumes a full duplicate of the main NSQ stream and populates the new cache. Importantly, we throw away all of the outputs produced by this second cache-warming service. So the only impact it has is to populate the new cache.

Switch over to use clustered cache, and decommission the rest

After 24 hours, both the old cache and the new cache contain the same data, and we can go through the delicate dance to shut down the cache warming service and switch the main service over to use the new clustered cache.

Conclusions

There is a compelling scene from the movie Apollo 13, when they discover that the available replacement CO2 filters for the command and lunar modules are different shapes and sizes, and they are running out of replacements for the lunar module. A team of engineers sit down to figure out how to connect the available square replacement filters to the air system that uses round filters, using only the tools and items available to the astronauts.

Our constraints were enormously less difficult, our criticality infinitely lower. But as engineers it is the challenge that drives us, the puzzle of the existing constraints, the investigation of degrees of freedom, and the satisfaction of a solution within the boundaries.


Working with Us

Life360 Engineering

Musings of Life360's talented engineers

Kent Hoxsey

Written by

Life360 Engineering

Musings of Life360's talented engineers