Recently, we did a hard thing in a stressful situation. And in getting it done, accomplishing the goal, some of my teammates came up with some brilliant ideas. Not just one person, not just one idea, but many. This blog post is about the experience of bringing smart people together to handle a challenge. And a cool hack.
The Pain — AWS and Spectre/Meltdown
For a bunch of us who run a lot of virtual machines in AWS, January began with some excitement: a vulnerability that affected all Intel chips and the systems that run them. The changes AWS applied to mitigate that problem ended up causing a lot of our machines to run “hotter”, consuming more CPU than before. Overall, our fleet ended up losing 15–20% of our efficiency, but some workloads got hit harder than others.
To make up for that loss, we scaled up wherever we could. We added nodes to any service that even looked like it might slow down.
Unfortunately, not all of our services were able to scale up easily. We started noticing a part of our pipeline that was not keeping up, even when we added more nodes.
Even as we started investigating, we had a suspect in mind. Many of our services use Elasticache Redis nodes to cache database queries, and we were hitting the maximum network throughput those cache nodes can manage. This problem has a clear signature in our monitoring, an obvious flat line in the network charts.
We run quite a few caches, so our entire team went to work on the problem. Elasticache allows for Clustered Redis caches, and switching our services over to use the clusters is a reasonable config change. We started with the hottest end of the pipeline, upgrading each cache in turn as the bottleneck moved downstream. Coordination and communication between dev and ops, great teamwork by all. Mission Accomplished, right?
Unfortunately, no. This is where the story begins.
Before changing over each cache, we would team up with one of the developers to review the code that populates the cache, figure out where the data originates, and try to estimate what kind of load we would impose on that part of the system to both serve our traffic and warm up the cache.
For most of our caches this was simply an exercise in due diligence, analyzing the system for impact, estimating the danger, and attempting to mitigate the effect. Sometimes it is possible to scale up the data source, but often we took the simpler path of identifying less-critical times of day when the load would be more manageable.
But one of our caches was more than just a cache of data, it was the only source. As we were discussing its criticality with the developers reviewing the caching code, we realized that the code was running a state machine and caching the state between events. There was no source for the data except for the incoming stream of events and the actions of the code itself.
No system to query from, no way to apply our method. To make matters worse, this particular cache is quite important to our users. We can not tolerate any real downtime from this cache, and starting from a cold cache was out of the question.
We needed to figure out a way to create the new cluster offline and warm it up before switching the live flow over.
One of the first options we looked into was to take a snapshot of the existing cache and then create the new cache from that image. Unfortunately, it can take a not-small amount of time to bring up a new cache cluster.
In this case, the amount of downtime to spin up a new cluster was unacceptable, even without figuring in the time to take a snapshot. We discussed how to manage a transition like this, but set the idea aside for more desperate times.
(Lack of) Replication tools
Somewhat to our surprise, our google-fu did not turn up any tools to replicate one Redis instance to another. It seems like the kind of itch somebody would have scratched already, but we were unable to find anything.
One of our teammates suggested building a simple redis proxy to pass through the commands to our main cache, but also play them against the new cluster to warm it up. We discussed this idea for a while but ultimately concerned about testing it out for the first time under such dire circumstances. Again, we set this idea aside to see what else we could figure out.
What we were seeking in our time of need was a solution that used the systems and skills we already had. Improvising under duress is hard enough, doing so with the added uncertainty of new tools and new code elevates the stress to brain-bending levels.
As we continued to explore the problem, we began to see another way. The system that populates the cache acts on events streaming in from a single NSQ topic, and then wrote out its various results to a bevy of other NSQ topics. While talking this through, we remembered an NSQ feature we use in other areas of our service: NSQ makes it really easy to duplicate a stream of events, by creating a new channel on the NSQ topic. This allows multiple applications to consume a topic in its entirety, without impacting the other applications.
But wait, there’s more. Another thing you can do with NSQ is to create ephemeral topics and channels, which are not buffered to disk but instead drop messages after the queue reaches a certain size. In the past we have used these as a quick hack to create a dev-null stream, a way to sink the output from a service without disrupting the rest of the system.
Knowing that we could easily duplicate the entire service inputs and outputs without disrupting the main production service got us thinking. We could run an entirely separate copy of our problem service just for its side effects, for the sole purpose of populating a new cache. At this point we began to get a little excited, and to map out the configuration we would need to fire up a clone service to warm up a new cache cluster.
The Really Hard Part
We run our service as a fleet of microservices, each of which is themselves an autoscaling, auto-configured fleet of virtual machines. We coordinate all of this in real-time using Consul.
In the case of our state-machine cache driver, we configure the NSQ details in Consul K/V values and let consul-template rewrite local configuration files when we make changes. The service itself only loads state on startup, a property that turns out to be important in this case.
The plan was to turn off consul-template on the running service group, effectively disconnecting those machines from the config changes we were about to make. With that done, we could reconfigure all of the NSQ settings and start up a second full copy of the cache driver service pointing at the new cluster.
That second version of the service consumes a full duplicate of the main NSQ stream and populates the new cache. Importantly, we throw away all of the outputs produced by this second cache-warming service. So the only impact it has is to populate the new cache.
After 24 hours, both the old cache and the new cache contain the same data, and we can go through the delicate dance to shut down the cache warming service and switch the main service over to use the new clustered cache.
This worked really well for us. We were able to shift over a stateful service from a constrained cache instance to a much more capable cluster, and to do so without imposing downtime, lost data, or weird behavior on our users. In addition, we figured out a new way to use our tools to decouple parts of our service that originally seemed too tightly integrated to evolve.
There is a compelling scene from the movie Apollo 13, when they discover that the available replacement CO2 filters for the command and lunar modules are different shapes and sizes, and they are running out of replacements for the lunar module. A team of engineers sit down to figure out how to connect the available square replacement filters to the air system that uses round filters, using only the tools and items available to the astronauts.
Our constraints were enormously less difficult, our criticality infinitely lower. But as engineers it is the challenge that drives us, the puzzle of the existing constraints, the investigation of degrees of freedom, and the satisfaction of a solution within the boundaries.
Working with Us
Life360 is creating the largest membership service for families by developing technology that helps managing family life easier and safer. There is so much more to do as we get there and we’re looking for talented people to join the team: check out our jobs page.