Infrastructure/Ops Stack Rank of Pain

Kent Hoxsey
Life360 Engineering
12 min readMar 5, 2018

Constant interrupts making you crazy? Unable to focus on fixing known problems because of higher-priority deployments that need hand-holding? Welcome to Ops!

Communicating Priorities

All teams prioritize their work, either explicitly or implicitly. Among the teams within Life360, we use a common representation of priority both to organize the details as well as to communicate those priorities to each other in our sprint reviews. Within our walls, we call that common representation a “stack rank”, even though it has nothing to do with the rating mechanism pioneered at GE.

But within the Infrastructure Engineering team at Life360 (“Inf-Eng”), the standard stack rank of work comes with caveats. We may schedule in our next 2-week plan to work on an important project (say, converting some of our services to run in our Kubernetes cluster), but that work is subject to override by another team delivering a new service and needing our help to get it into production. Or by a new scaling hotspot showing up due to growth in our user base. Or by fire-fighting because of a crisis at our cloud provider (please, no) or other uptime-threatening issue.

While the stack rank communicates the list of projects we have in our backlog, it does not illustrate the very real priorities of our role within the company. Most importantly, it ignores our most important responsibilities: managing our service for availability and performance is a key component of our role, as is our work helping to deliver new services built by the other teams. When these issues would show up during reviews, the reasons were often clear to everyone. But that still left us feeling that there was another way to communicate what we do.

Enter Maslow

In an attempt to communicate this better, and to provide some entertainment, I introduced Maslow’s Hierarchy of Needs as a model for our operations team’s stack rank. Representing our priorities that way felt suitably atavistic, but also captured the attention of everyone attending.

Maslow’s Hierarchy of Needs. Thanks Wikipedia.

The levels of the hierarchy map reasonably well to our set of priorities:

The bottom level of the pyramid, Physiological needs, map to Service-Down events (often called P0, for highest-priority). When the service is down for whatever reason, that is our absolute focus. In a bad situation, we would be willing to forgo a less-important part of the service in order to get the main functions up and running again. Since we run a service with an API, I refer to these priorities as “API CPR”, with all of the connotations of imposed trauma. (True story: in my CPR certification my instructor said “if you aren’t breaking ribs, you’re probably doing it wrong”).

The next level, Safety, maps well to Fire Fighting (often called P1 events), the ongoing struggle to identify and address the things that go wrong on a regular basis — but without having the time or other resources to fix the root problem to eliminate the issue. Every operator out there knows this pain, and the frustration that comes from spending a lot of time in this mode. This is where our tools haunt us, paging us so often that we have no time to think, only act. This is where the toil lives.

These situations need not always be a crisis. A well-tuned ops team can have a set of plans for adapting to known challenges. When the crisis appears, the team can swing into action with confidence, ameliorating if not solving the problems as they arise. Becoming such a well-tuned team requires putting in the time to brainstorm, to document, to try out approaches. Which is the kind of project work that tends to get delayed by…fire fighting.

The third level of the pyramid connects us to others in the organization. For us at Life360, the tasks I associate with this level are those that help our other teams accomplish their goals: deploying improvements to existing services, spinning up infrastructure to support new services, implementing any of a myriad of tunings or speedups to make our service faster or more resilient. It is fitting that this level is called Love and Belonging, and it makes me smile when I get to present to the other teams where their work sits in our world. When everyone is communicating well, these tasks move the business forward and make us all feel good about our contribution.

Of course, there is a dark side as well. As an ops team tasked with protecting and securing the infrastructure, all requests for access come to us. There are a number of areas in which we are the gatekeepers, and thus need to perform an action before another team member can even begin their own work. These kinds of tasks hit our board as Interruptions, and are often called Drive-bys. When other parts of the organization are under time pressure, these kinds of tasks arrive last-minute, under duress, and emotionally charged. If we fail to execute them in a timely manner, other teams may feel like we’re holding them back.

The next level of the model is called Esteem, and represents those things we do both for others’ esteem for us, and for our own self-esteem. I think of these as the tasks that improve the quality of our infrastructure, making our environment better. This is where we can seek the alchemy of doing-more-with-less, achieving new capabilities and amazing our organizations. Most of our project work lives here, in our backlog.

And herein lies the challenge. Our project work, our best idea of what the organization really needs from us to achieve next year’s goals, exists at the fourth priority level. A small subset of these projects are crisis-response plans that will be elevated in priority when the anticipated crisis appears, but most of them are mundane, long-term fixes: automate the database backups, implement containerization for key services, upgrade a critical infrastructure component. Mundane, boring, deadly critical to long-term capability. Easy to postpone, because whatever problem the work addresses is not critical today.

The final level of the pyramid is Self-Actualization, and maps beautifully to every infrastructure person’s desire to build a better tool. Automating our world, scratching the itch of an annoying problem. Contributing the solution out to the open-source community at large. We really want to spend time doing these things. Every ops person I’ve ever known has at least one pet project in the back of their mind, waiting for an opportunity to spend a little time on it.

Add in my little annotations as labels to the original diagram, and we get the next phase of our story:

Inf-Eng Stack Rank of Pain

Can we call this “The Stack of Pain”?

Immediately upon seeing the diagram, one of my team members asked: “can we call it the Stack of Pain”? Thus was Maslow’s Hierarchy of Need reborn as the Inf-Eng Stack Rank of Pain, a model for thinking about priorities for infrastructure teams.

With respect to the distinguished psychologists working from Maslow, I am taking enormous license here: we cannot read deeply into my “model”. This presentation is meant to entertain and to provide a foil for discussion of how operational priorities differ from other aspects of the organization. The main idea I want to draw out of Maslow is that work/needs in lower levels of the hierarchy take precedence over work/needs higher up. With that in mind, we can reorganize to a more common stack rank presentation:

There are a number of observations that ring true for me when I consider our world through this lens. It is clear to everyone why there are no questions about focus when a P0 arises. It is also clear that fire-fighting pre-empts project work, even project work to fix the thing on fire. When the service is not responding properly to users, it is a big deal.

What may not be so clear to the organization is just how much fire-fighting work goes on, how often the ops team handles a P1 event without interrupting people in the rest of the company. It is quite easy for an ops team to get good enough at handling the “ordinary” fires so they don’t escalate into crises, but never fix the underlying problem so the fires disappear. Do that too long and it can become normal, even expected. And we won’t even go into the problems that come about in organizations that reward the heroics of putting out the fires but not the careful effort of preventing them.

Beyond the non-discretionary work, things can become more complicated. An infrastructure team can easily feel that they serve two warring masters: support work to enable the other teams in the company to deploy and deliver on their commitments, and the inevitable background work to maintain and operate the sum total of the company’s infrastructure — including technical debt.

For any growing company, new-feature deployment work pre-empts maintenance work, because keeping the other teams unblocked is a force multiplier for the organization. It is important for an infrastructure team to conceive of itself as the foundation of those other activities, to understand its role enabling the company to achieve its goals.

Often when that enabling work feels burdensome, it is dropping in late and at high urgency. The best way to address these kinds of issues is to engage in proactive coordination with leads of the other teams. There have been times when I have not been aggressive about this outreach, and I can point to the times I have suffered from it. If you lead an infrastructure team, you can probably feel a big drop-in looming out in the ether, just from listening to the company cadence meetings. My advice: get out there, touch base with the managers, the team leads, the programmers. Work the back channels, the formal channels will take care of themselves.

Another bit of advice: take a look at the areas where your team plays gatekeeper: shared data access, cloud resources, new user setup, etc. Identify the normal tasks, formalize your processes, document them. If the operation takes more than a few commands, or if it requires touching multiple screens in a web console, automate it. Once you get so it takes longer to complain about a drop-in than to take care of it, the problems will go away.

Focusing on the needs of the rest of the organization can create a new problem, because all that project work still exists. Your project work is meant to improve the overall scalability, resilience, and cost profile of the organization’s core infrastructure. You can have a serious argument about whether that task is more important long-term than the force-multiplier of keeping the development teams unblocked. But to be successful, it is important to figure out a way to prioritize and complete those tasks.

For our environment, it seems that we operate in a binary world. When things go well for us (no P0/P1) we can knock out a lot of work during a 2-week plan. But when something goes wrong we are going to complete none of it. We are still working out the tactics to be as effective as we’d like to be, but have reaped a lot of benefit from two particular efforts: we invested heavily in our runbook documentation for complex common operations, and we spent several months relentlessly fixing or de-escalating alert noise. These efforts have paid huge dividends both in lower on-call stress as well as fewer interruptions with easier remediation.

Aaaaaaargh. You’re Not Helping!

Does this all sound familiar? Triggering PTSD? Tired of people telling you to just suck it up, the pain you feel is just a part of the job? I wish I had a magic methodology to recommend, but the only real fix is communication. Amongst your team, across the teams you support, across the organization. To get you jump-started, here are some things we’ve found useful:

Tactic: read some Allspaw. This one is quite good: “engineers are defined not what they produce, but how they do their work”. This one is also very good, on Blameless Post-Mortems. Spend time as an ops team, talking about how you do your work.

Tactic: read the SRE book. Pick one of the techniques and integrate it into the way you do your work. Repeat.

Tactic: plan out a full-team response for those problems that crop up as a P0/P1 and get “resolved” with a workaround rather than an actual fix. When the current priorities won’t allow you to take the time to evict the root cause (it happens, be cool) take a couple hours with the team to brainstorm the right fix, set up an epic in whatever task-tracking tool you use, and document that solution. Every task, every command, right now while it is fresh in the minds of the team. The next time that problem comes around your team will be locked and loaded, and you can use the crisis to fully fix it the right way.

Tactic: as an ops team, discuss how to create some space from the toil for one of the team members to focus on building a tool. Everyone pledges to free somebody from the toil for a couple weeks, and that person goes heads-down to build something that really moves the rock. Be clear about the sacrifice as well as the goal. Team members sacrifice for each other, and the team gets better as a result. Do that regularly, and you can build a culture in which everybody gets a chance to stretch and build something they care about, and a bunch of hot new tools.

Tactic: treat your important infrastructure project work in the same way as all the work from other development teams. Define it with the same rigor you require of other feature work, plan the effort (epic, tickets, point estimates, everything), resource it appropriately, and advocate for it in the same planning session as all of the incoming work. When your project work gets allocated, make sure to get it done. Building a record of accomplishing your goals goes a long way to buying you more leeway in the future.

Tactic: keep score and report it out to the organization every week or two. We do regular two-week plans with a review at the end, so it fits well to give everyone the rundown. Let everyone see where the work went, and be flexible in allowing discovery and discussion. Amazing as it might seem, all the other teams in your organization a quite busy themselves, and they don’t necessarily know what you’ve been doing.

Not a bad week

We always include some statistics of our performance to SLA, because availability of our service is an obvious factor in our customer satisfaction. And we will include a chart from PagerDuty showing how many incidents we handled in the most recent two-week period. Making that effort visible to everyone, as well as celebrating the 3rd-level enabling work, allows everyone to understand better what we have been doing and how it impacts the company goals.

The Big Finish

One of the things that can make infrastructure challenging is the broad set of priorities, and the interrupt-driven lifestyle that can bring. I have argued above that there is real value in the areas that bring the most pain, and that value comes from enabling the other teams to stay focused rather than fussing with the underlying tools. Done well, an infrastructure team is a force multiplier for the rest of the organization.

But achieving that level of capability means vanquishing quite a lot of complicated, messy lower-level detail. Not just keeping the basic service alive, not just responding to the ongoing needs of a growing concern, but innovating on the core capabilities in a way that opens new space for the rest of the organization to occupy, and communicating that evolution to the organization so they too can plan and evolve.

Working with Us

Life360 is creating the largest membership service for families by developing technology that helps managing family life easier and safer. There is so much more to do as we get there and we’re looking for talented people to join the team: check out our jobs page.

--

--