On-call is not a technical problem, it is a people problem

15 min readAug 19, 2020

For some reason I really love incident management?

As we enter a ~* brave new world *~ of remote work, it’s more important than ever that we have responsible, respectful people-centered policies, including our incident management programs. The line between being at work and being at home at this point isn’t blurry, it’s battered. This kinda means you have to care about your team, like a lot. Like I know they’re spoiled, highly paid software engineers, but this isn’t the Suffering Olympics! And even if it was, NO ONE WINS IN THE SUFFERING OLYMPICS. THERE ARE NO MEDALS. EVERYONE COMES IN LAST PLACE. The people who you think come in first place are actually just least losers. It’s not a great event. I’m not sure why we keep holding it.

On-call is not a technical problem, it is a people problem.

Not sure why this has become my rallying cry, but alongside “The Myers-Briggs is nonsensical garbage,” it will be carved into my headstone. I low-key don’t care about your technical problems — that’s what your engineers are for! But in my professional opinion, most on-call experiences suck and worse still, on-call gets treated like it’s a problem that doesn’t exist? Like, obviously it does exist, but we just kind of ignore it and hope it magically gets better. Which is a strategy for dealing with it (not making a choice is still a choice!!), but magical thinking is probably not a good strategy? Or at least, it’s not a good strategy if you want your engineers to not hate it.

Speaking of, “not hate it” is my personal low bar for feelings about on-call. It feels weird and disingenuous to make your goal “loving it”. Even if you love your job, no one “loves” getting woken up at 3 in the morning because a computer broke. In fact, universally* everyone hates that.

[* cue some dude who chimes in with “Actually, I find it really invigorating and frankly it’s healthy to know how your systems are operating and if you really care about your customers, it’s an honor to lose sleep.” Ok, my guy. You don’t have to read this? Go on, close the tab. Go on now. Bye!]

Anyway, where was I. Computers suck, and that they will break is an inevitability that no one wants but here we are. It’s 2020! No more denying reality! Your computers are going to break, and step 1 is admitting you have a problem. Not a technical problem, a people problem. A problem made by people (hi we invented computers and keep using them!!) affecting people (your customers! Your shareholders! YOUR EMPLOYEES).

So what can you do?

From the moment you have your first customer, you are on-call. Congratulations! But most of you are probably far beyond the first customer. Some of you may even be publicly traded companies with hundreds of engineers! (Or thousands? If you have thousands of engineers you’re probably not reading this. If you are, hey welcome to the group. Have a seat, we are a group of self-healers and we are committed to holding space for our mistakes so that we may learn from them.)

Let’s define some terms. To be on call is to be the engineer who has to act when something is broken. An incident is when something breaks, generally in a production environment (meaning, your customers rely on it directly or indirectly). Incident response is how, uh, you respond to incidents. Hopefully you have a plan in place that streamlines your incident response. If not, good news! Keep reading. Incident management is the overall care and feeding of your incident response plan plus some other stuff, though I sometimes use the terms interchangeably.

As I said, from the moment someone outside your company is relying on your software, you are on call, whether you have a formalized, organized process around that or not. Meaning: at some point all or part of your website will go down, and if you want to keep your customers, you or somebody is going to be responsible for bringing all or part of your website back up, preferably sooner rather than later. (I’m not going to talk further about the needs of early stage startups because it’s outside my expertise and also honestly, a robust incident management plan when you have three engineers is probably overkill.)

Briefly, you need monitoring and alerting, if you don’t have them already. (Both. You have to have both. You can’t set up alerts for things you can’t see.) I feel a little silly even bringing it up, but I’ve heard enough horror stories to know that I shouldn’t skip this part.

Don’t know what kind of monitoring and alerting you have, whether it’s meeting your needs, and what kind of improvements to make? Ask your engineers. They built the dang apps and infrastructure that make up your product, and they probably have a good idea what they need. And don’t just like, ask the engineers you talk to all the time. Ask a bunch of them. Ask them to write it down. Actually give them the time to investigate and think on it and document it. Let there be a paper trail. Bring it up as a job well done in their next review. (If you’re an engineer asked to do this, keep your receipts.) Next, and this is vital, you have to act on this information. One surefire way to annoy your employees is to ask them to do a bunch of work and then yeet it into a black hole. If you need help organizing and managing this information, hire a program manager to oversee it.

But I didn’t really come here to talk about monitoring and alerting, I came to talk about incident response, so let’s say you’ve got monitoring and alerting, or you’re on your way there (nothing like building the plane in mid-air I always say). Whatever the case, you have systems in place to know when your computers are back on their bullshit. An alert goes off!!!!!!!!!

Now what.

Well, theoretically the lifecycle of an unplanned outage, what in the biz we call an “incident”, goes something like this:

Oh no the computer is broke
Another computer (your alerting system) and/or a real life human being notices the computer is broke (via monitoring, logs, or maybe your website is being really slow for some reason?)
An engineer responds and fixes it

That’s it! Super easy, right? Thanks for coming to my TED talk or whatever.

But in case you are curious about the ways in which a realistic, workable Incident Response Plan — you know, a organized set of policies and procedures designed to help reduce ambiguity and speed up the amount of time it takes to resolve an incident — might be beneficial to your org, you can keep reading!

GASP BUT WHAT ABOUT MY BEAUTIFUL METRICS

You’ve probably got some kind of incident response plan in your org already. How is it? I’m genuinely curious. I’m always interested to see how different orgs handle (or don’t handle) this issue. But like, don’t answer that question by telling me your mean time to resolution (MTTR) because until you have a mature incident management program I don’t care.

Ok, so first of all, MTTR is by definition an aggregate value, and without a lot of other information about your incidents for context it’s pretty useless. For example, pretend you’ve had four incidents this month that took the following amounts of time to go from broke to fixed: 10 min, 23 min, 28 min, and 7 hours. The MTTR for this month was 2 hours 15 minutes. If that feels skewed, well. Yeah.

You could also give me numbers related to overall site availability (so many nines)! That too is a cool metric to track, but it loses a lot of information in the calculation. Which of your services are pulling that number down? Which are propping it up? What’s going on with that one service that’s always breaking?

Am I saying you shouldn’t collect this information? No, that would be silly. Watching how these numbers trend across time and noting how those trends relate to deployments, key deliverables, or even current events is valuable information. But again, incident response is a people problem, and trying to reduce people problems to overall metrics like time to resolve and site availability obscures the people who actually have to do the work to resolve the incident and keep the site available.

In fact, for the purposes of this article the only metric I care about is how the people who have to be on call *feel about being on-call*. This is something that can be measured when you know how (if, say, you happen to have a graduate degree in psychological research and wrote your thesis around scale development, for just an example, off the top of my head).

That’s right, we’re gonna talk about capital-F Feelings.

How do your folks feel about being on call? For our purposes, imagine the type of engineer who might strongly agree or strongly disagree with a statement (as it relates to on call) like “meh, I don’t love it but it’s really not that bad”. So how do you think your team members would respond, and why do you care?

Let’s start with why you should care. Not to get super philosophical on you, but like, the world is that which is the case. If your engineers *hate* being on call, they’re going to *hate* being on call. You dig? A bad on-call experience is not solely defined by the technical problems they’re facing and the time of day they’re facing them. People’s experience of things generally reflects their attitudes toward those things. I love roller coasters, so I almost always have a great time on roller coasters, regardless of the ride. If you hate roller coasters, you’re probably not going to enjoy the thrill of plummeting rapidly toward the ground in the name of entertainment! Employees who harbor hatred and fear toward parts of their job… is probably not something you want? Like, if you don’t care whether your employees like the work they’re doing for your company, then I hope you get visited by three ghosts this Christmas because I don’t know what else will help you.

A woman on a laptop picks up her phone — perhaps she’s responding to an alert!! Source: #WOCInTech Chat via Flickr — Source: #WOCInTech Chat via Flickr

When I surveyed a group of over 100 engineers at one company, I found that the factors that most influenced their attitudes towards being on-call did not include how much computers suck. They’re software engineers. They know better than literally anyone how terrible computers are. The folks with the most positive attitudes toward being on-call (the ones who might agree with a statement like “meh I don’t love it but it’s really not that bad”) had two things going for them:

They felt prepared to go on call
They felt like their team in general had a positive attitude toward being on call

Let’s break those down.

Feeling Prepared to Go On Call

What does preparation look like in this case?

Feeling prepared is another way of saying “having an idea how to navigate the problem.” When I dug deeper, this generally meant two things: feeling confident enough to know how to investigate a technical problem, and understanding what their role is during an incident.

Feeling confident enough to know how to investigate a technical problem is a tricky thing. Some engineers (mistakenly) believe that means knowing every service your team is responsible for frontwards and backwards (uhhhh probably not an achievable goal). In my incident response training, I reassured each engineer that no one expects them to know their entire app and the environment it’s running in like it’s some kind of mind palace they should be able to visit to find anomalies. In incident response, the on-call engineer is expected to triage the problem, not magically know how to fix it.

The on-call engineer’s triage duties go like this:

Once alerted, she should establish whether or not the problem is in fact a problem! Sometimes alerts go off that didn’t need to. That’s okay. Adjust your alerting, communicate that change to your team, and move on with your life.
Second, once she’s established the problem is a problem, she should determine whether she can fix it on her own in a reasonable amount of time (what is reasonable will depend on what’s wrong).
Third, if she can’t fix it on her own or she can’t finish the work in a reasonable amount of time, she should send up the bat signal for help.

On-call engineers need to be empowered to trust their own judgment. (Do you see by now how these are people problems and not technical problems?) Sometimes they will make the wrong call! They will think the alert that went off wasn’t a problem, but it was. They will think they can fix it on their own, but they can’t. They will call for reinforcements when they didn’t need to. I have a revolutionary idea for you right now that I’m going to give you for free even though I spent years in grad school racking up student loan debt to tell you:

It’s okay.

“Everyone makes mistakes, oh yes they do,” a Sesame Street song of my childhood goes. “Your sister and your brother and your dad and mother too. Big people, small people, matter of fact, all people! Everyone makes mistakes, so why can’t you?” On an engineering team, there are plenty of opportunities for mistakes!

Unless you’re a monster (an actual one, not a lovable furry monster like Grover), you know that mistakes are how we learn. It was a mistake to deploy that app without some automated way to rotate the logs. And it was a mistake to just delete the logs and go back to sleep, knowing full well the disk will fill up again soon. Or it was a mistake not to check what was filling up disk space in the first place and instead wake up three of your teammates because the service is hard down and you panicked.

We don’t learn good judgment from all the right choices we made. We learn good judgment from our mistakes, by observing the mistakes of others, and keeping a healthy sense of humor about them.**

[** cue some dude who simply must chime in at this time to be like, “Actually, I *always* auto-rotate my logs so I’ve *never* had this problem. Some engineers are such amateurs, and they should feel bad about their mistakes.” Hey guy. No one likes you. You make life at work worse. You drive good people out of engineering. Sit down and shut up, or stop reading and close the tab. Bye!]

All of this is to say, engineers who feel confident to investigate a technical problem are engineers who are prepared to learn something today, even if that means they get it wrong on the first try.

Next, let’s talk about what it means for them to understand what’s expected of them during an incident.

When I started talking to engineers about being on-call, one thing I was surprised to hear was how much anxiety they had around what their role during an incident even was. My engineers got so caught up in the fear of Not Being Able To Fix Everything that they couldn’t understand what “triage” meant, even after I explained it.

“Triage means figuring out if the problem is a real problem and whether you have the ability to fix the problem on your own.”

“I can’t fix these problems!”

“What problems?”

“I don’t know! How am I supposed to know what to do when I’m on call?”

One technique I found to reduce anxiety was to describe the doomsday scenario at the beginning of my training sessions. As in, oh no something’s broken! Alerts are going off! You don’t know what the problem is, and what if you make it worse? What if the site goes down? What if it *stays* down? FOREVER?!! Our customers will bail, our stock will plummet, the company will close, and we’ll all be out of a job! ANN HOW CAN YOU PUT THIS MUCH RESPONSIBILITY ON MY SHOULDERS

Of course, I did this because shedding light on our irrational fears generally reduces the power they have over us. The actual worst thing that will probably happen while you’re on call if you make a mistake is that the incident will drag on a little longer than it otherwise would have. Maybe you wake up one or two people you didn’t need to. And as I reminded our managers and engineers over and over again, unless you make a habit of it, it’s really not a big deal. (If you do make it a habit, something else is going on, and you and your manager need to have a heart-to-heart.)

To reduce their anxiety over this ambiguity, I developed a curriculum that would set org-wide expectations about incident response. Among other things, this included knowing your own role as well as the roles that other people play during an incident. I taught everyone the roles of Incident Command, Communications, Subject Matter Experts, CustOps, and yes, even Executives. (In short, Executives hurt more than they help during an incident and should stick to bothering only the Communications person, if they absolutely cannot stay away.)

Understanding these roles meant knowing that they were part of a group effort focused on a resolution and that everyone had a part to play. They were therefore absolved from Knowing Everything and Doing Everything. I reassured them that the people they will work with during an incident are the same people they work with Monday through Friday, and when they need help, they’re not alone.

Feeling Like Their Team Had a Positive Attitude toward Being On Call

Surprising absolutely no one in 2020, people share the attitudes of their in-group. In-groups (vs. outgroups) are a handy way that we as a social species define who is us and who is them. We like our in-groups, and we want to stay in our in-groups. We share the values of our in-groups, and in a twist our ancestors never saw coming (they just needed to know who to share resources with), our in-groups include the engineering team we’re on at work, so I was not surprised to see that people generally shared the feelings of their team when it came being on-call.

If engineers felt like their team had a generally positive outlook toward on-call, then so did they. If they felt like their team had a negative outlook, so did they.

Attitude change is, as a rule among social psychologists, a challenging thing to do, but that didn’t stop me from trying. I knew because of my data that when I introduced the incident response plan to teams I would be greeted either uniformly positively or uniformly negatively. So what did I do?

The rollout of a massive program like this can be done a couple of ways, including but not limited to:

Throw the info up on an internal wiki, send an email, call it a day
Meet with managers, have them tell their teams, call it a day
Meet with teams, and never call it a day because incident response is a practice, not a checkbox

Obviously I did the last one. Once I developed the training curriculum, I committed to meeting with every single engineering team in the company to go over this plan. No matter how good your incident response plan is, being on-call is disruptive to people’s lives. They are naturally resistant. If I was going to win over whole teams, I wanted the plan to have a face (my face! It’s so friendly!), and I wanted to make sure that they knew this wasn’t just some policy that some higher-up was making them do. It was essential that they understood why we were implementing this plan in the first place, what problems it would solve, and what it meant for them. I wanted them to know how much thought went into the program to make sure their on-call experience was as painless as possible. And there is nothing more valuable for making your case than by literally showing up! I was able to change hearts and minds because I was there to empathize, listen, and address their concerns. Once they had a person attached to the program, they had someone to reach out to later with questions, suggestions, or what to do about weird edge cases. Two or three people even thanked me. They weren’t on their own.

tl;dr

A lot of people aren’t going to agree with this article because it’s about feelings. Those people are heartless bastards and frankly I feel bad for anyone who has to work with them. Is there way more to an incident management program than I laid out here? A-yup.

Are there technical decisions you can make that will improve the on-call experience for your engineers? Of course! Tuning your alerts takes care and attention but in the long run can save you from a lot of false alarms. Sign up for PagerDuty or one of its competitors and learn to use all their bells & whistles so you make sure you’re getting the most bang for your buck. I’m told deleting all your code is always an option. Integration tests, QA, canary deployments, whatever, can help you spot problems before they’re a big deal. Probably. I don’t know man, I’m not an engineer, but I do know that talking to engineers is the best way to find out what they need. (And gosh darnit, rotate your logs!!)

It’s not enough to write empathy into your company values if the policies and programs that actually affect people’s work aren’t empathetic. My advice to everyone who has to work with other people (so: everyone) has long been, “People just want to know that you see them.” Open your eyes. We’re right here. We’re ready to be seen!

In the current state of the world, people want more and more to be able to bring their whole entire selves to work, including their capital-F Feelings. The onslaught that 2020 has brought means that many of us don’t have a choice. We are all emotion all the time. As managers, the best thing we can do is address them head on.

Ann Harter is a program manager with a background in technology, psychological research, and engineering. She’s currently looking for work in the middle of a pandemic, which is about as fun as it sounds. You can find her on LinkedIn here.