On the Life360 Infrastructure Engineering team, we manage critical system support using a standard multi-tiered on-call schedule. (Hellooooo PagerDuty). I just completed a week on the first-called tier (“Level 1” or L1), and noticed the remarkable difference in experience from similar on-call periods last summer.
Thinking it over, I realized that my team has done this really cool thing. It wasn’t dramatic or flashy, nobody made a diving catch. Through care and attention and disciplined effort, we cut our incident rate from 100s per week to less than 20.
This is not to say we don’t still have critical alarms, or that things don’t break or behave strangely. As an Operations team, Murphy rules our world. Planning for that uncertainty and building to minimize risk is a big part of sustainable ops. But looking back at where we’ve been, there are other things that were critical to driving the improvements. This was a team effort, every single person contributed to our current state.
Murphy Rules Our World
Thanks to experience in the industry and outreach through meetups and gatherings around our office in San Francisco, we know quite a few people who do this kind of work. We have shared our war stories, and listened gravely (ok, gleefully) to those of our compatriots. One thing is absolutely clear: systems will fail, unexpectedly and in new, unanticipated ways. Maybe this is obvious, but if you want to get out of the cycle of normal failures and alert fatigue, you must fix the problems that fail the most.
A lot of us have worked on teams where the same issues came up every week. There was no need for a runbook (and nobody who would take the time to write one), learning the fixes for the common failures was a rite of passage and a sign of competence achieved. The Usual Suspects were simply a fixture of the environment.
It can take some kind of trigger to alter the dynamic, a breath of fresh air to change everyone’s mind.
First thing to do: change your mind
The most important contribution was a mindset change. Following a particularly difficult on-call period, we started talking about how much noise we were encountering, and the time it would take to fix. That blossomed into an initiative to make every alert a required action. We decided as a team to let the person on L1 set aside their other responsibilities for their week on call, and focus on fixing or addressing every alert that came up. Then one day our manager arrived all full of intent and defined the mission: actionable alerts only.
We have a channel in our company Slack where all kinds of alerts, alarms, warnings, and notifications land. When we started on this effort last summer, there was so much traffic that the channel was useless to us except as a log to investigate during a post-mortem. Our manager established the goal of making this channel contain nothing but messages an operator needed to address. If a message hits that channel, the person on L1 should be logging in to make a change.
This mindset change took a while to take hold. With enough fires burning, project work that might be late, normal interrupt task volume, and a mindset that we didn’t have the time to invest, it took some effort and reassurance from each other to begin making progress. During our weekly team check-in, we would talk about the alerts we were seeing, and encourage the person on L1 to take a shot at fixing some of them. Over time, we each got enough encouragement to pick an annoying alert and rescale it, or fix the underlying problem, or schedule the work to make the system change to eliminate it. Small changes, incremental progress, but significant compounding of benefit.
The part that was really cool happened after we had been pursuing this mindset for a couple of months. We all began to feel empowered when we were on-call. We had explicit permission and backup from the rest of the team to deep-dive on the current operational noise, find the root cause, and squash it. And then celebrate the accomplishment, as we could chalk up another problem annoyance that we would not see again. This led to a couple times when the person on L1 would perform an epic deep-dive into one of our services, to tease out the hidden relationship(s) causing the alert. And then fixing them.
It is hard to understate how satisfying it is as an ops person to be empowered to fix the things that break. After all, our main reason for existence is to keep the system running. What could be more important than fixing the problems that interfere with that mission?
You’ve got to do the work
The funny thing about Murphy is that the biggest problems are hard to isolate because they are often built up from the smallest, most subtle configuration issues.
An example: one of our services would take a really long time to restart — multiple minutes. Using strace we could see that it was connecting to its consumer queues properly, although there was a long lag before initializing the producer queue. There was no problem in the code, it didn’t behave this way in development. Only after a concerted deep-dive did we realize that the problem was a tiny configuration issue. Our queuing system uses a set of coordinators split into two groups. One of those groups carries orders of magnitude less traffic. And that much-quieter group was the first coordinator in the list.
So the client would connect to the first group and begin processing the tiny flow of messages. Over the next couple minutes it would engage with the other coordinators and start processing our volume. This problem didn’t exhibit in development because we don’t run multiple groups of coordinators in that environment. And if by chance we had configured the groups in a different order, we would never have noticed this problem.
Fun side note: as I was writing this up, the team member who discovered all of this had to correct my original idea of what the problem was. I had not known until today that we run two groups of coordinators, and I still don’t know why. There may be a good reason, or it may be simple fact that the right fix is to merge the coordinator groups. One of our team’s tenets: we never stop learning.
In terms of reducing alert volume, the highest payoff for our effort came from addressing log file management. We use logrotate everywhere, but some of our services run at high enough volume they would still overflow their disks. Adjusting the rotation parameters for each service that annoyed us took a few weeks, but each time we fixed one it disappeared from our view.
Another class of annoying problem is the service that dies regularly, but not predictably. We have a couple services that were suffering from a resource-consumption issue: memory consumption in one, long TTLs in cache in the other. We knew these services would fall over at some point during the week, but it was hard to predict exactly when. We could have set up a cron job to restart them every day at an unobtrusive time, but there were service benefits to keeping them running longer.
The right answer was to rebuild the service to avoid the resource problem, but while that task worked its way to the top of the stack rank, we set up a watcher script to look for the error condition and restart the service. Note that there are some complications to this kind of automation: you need to ensure the script doesn’t see the startup state as matching the failure condition or you get a crash-loop. But when it works, you have automation that handles the task an operator would perform, reliably and predictably.
Sometimes it isn’t about you
We did not do all of this ourselves. Already mentioned above, one of our core services just needed a rewrite to address its resource consumption. That rewrite built on all the good things learned from the earlier version, and has been running pain-free almost since its launch day.
Another problem that bothered us regularly was a cache that would fill up and crash, every 36-ish hours. This was extra-annoying, since it meant the L1 could count on a wake-up at least once during a shift. We discussed setting up a scheduled restart to clear it at a convenient time, but before we put a bandaid on the problem we dug in to figure out what was happening.
Some spelunking in our redis cluster showed that the cached data was not the problem. The problem was a single, massive object that linked all of the cached data together. As we added items to the cluster, the linking object would accrue another index entry. Once that object overran the single node’s memory, the cluster would crash. Aha!
With that information our sibling development team cracked open the code for the service and isolated the configuration for the caching library that was creating the linking object. Like many of our other big annoyances, this fix turned out to be a tiny configuration change. But the real lesson here is to look into the root cause for your problem areas, even if an easy workaround is available. A real fix is always preferable to a workaround. Not only does it remove your pain, it increases your knowledge.
Where do you want to go?
While it is nice to think we could automate our way to a system that self-heals and never throws an alert, that seems unlikely. Our current setup is far from perfect, and a long way to go even to accomplish the obvious tasks we see ahead of us. But those are the challenges that drive us, no?
Come join us.
Life360 is creating the largest membership service for families by developing technology that helps managing family life easier and safer. There is so much more to do as we get there and we’re looking for talented people to join the team: check out our jobs page.