Patterns and Anti-Patterns of the Primary Interrupt

Credit: https://www.flickr.com/photos/deborah_s_perspective/12548754903

Contents

Preface
What Is An Interrupt Rotation?
Interrupt Rotation Anti-patterns
Anti-pattern: Doing Project Work During Your Interrupt Shift
Anti-pattern: Working Ridiculous Numbers of Hours on Tickets
Anti-pattern: Silently Struggling
Anti-pattern: Taking Steps That Make You Uncomfortable
Anti-pattern: Ignoring Difficult Tickets
Interrupt Rotation Patterns
Pattern: Don’t Worry! Everything Is Cool.
Pattern: Be Nice! Interrupt Is A Customer Service Role.
Pattern: Find the Best Person To Address the Problem
Pattern: Rely On Your Team
Pattern: Push Back To Collect Details
Pattern: It’s OK To Decline Invalid Requests
Pattern: Take Only Comfortable Steps
Pattern: Document Your Progress and the Lessons You Have Learned
Pattern: Budget Your Time
Pattern: Improve the Rotation
Conclusion
Acknowledgements

Preface

Interrupt rotations can be stressful and challenging. Perhaps …

  1. you get a lot of tickets
  2. you get tickets that require several steps
  3. you get unclear tickets
  4. you get ANGRY TICKETS WRITTEN LIKE THIS
  5. you get tickets with no playbook entry, or the playbook entry is poorly written, or has frightening steps.

Do not be fooled into thinking that your job as primary interrupt is to handle every request, precisely as it’s asked for, by yourself, as fast as possible. It’s certainly a mistake if by the end you want to throttle your ticketing system.

Part of the problem is that ticket management requires a very different work style from what’s required to write software. You’re great at writing software. But like all things, being good at interrupt work requires training and practice.

Use this document to address your own unhealthy behavior while operating inside a reasonably healthy interrupt rotation of a reasonably healthy team. This document does not address the flaws of interrupt rotations. If there’s really too much work to do in a single day, or the manager is a tyrant, or nobody provides training, you have other issues.

Even still, this advice might not fit for your team, so take that into consideration. Most important: as I say below:

… please don’t rush. Real emergencies are life and death situations. This is not an emergency.

If you literally deal with life and death situations, this document is not for you. You can get better guidance elsewhere.

Finally, if you are a manager or leader looking at how to build an interrupt rotation, you can use this as a guide. For instance, I urge that primary interrupt skips normal work during their rotation. If you’re expecting the primary interrupt to do both jobs at once that’s your prerogative, but it comes at a well documented cost. But I trust you.

What is An Interrupt Rotation?

Any engineering team with something in production will have an influx of requests that disrupt their normal flow of work. Examples include fielding requests on a support mailing list, triaging bugs, ad-hoc reporting, and minor data change requests. Eventually many teams will share this responsibility across all their members on a schedule, and this is what I refer to as an interrupt rotation.

While the interrupt rotation could go by other names, for instance, on-duty rotation, this is not the same as an on-call rotation. The term on-call itself typically suggests that someone is available 24 hours a day, 7 days a week, and is reserved for urgent issues, where the engineer is paged to come triage a crisis. Interrupt rotation, done properly, is a far less adrenaline packing experience, and the work ends when the employee goes home.

I’m making these assumptions about your team’s interrupt rotation:

  • The purpose of the interrupt rotation is to address repetitive tasks or ad-hoc external requests.
  • Your source of interrupt requests comes from a ticketing system. Your interrupt rotation might have a number of sources of work tasks (email, bugs, and a regular checklist) but let’s just keep it simple and call them tickets.
  • You have a playbook that documents how to handle most of the incoming tickets, hopefully reducing them to simple repeatable tasks.
  • You have a primary interrupt and a secondary. The secondary is often not actively working on interrupt tasks, and is usually the next primary interrupt.
  • These incoming requests are almost never urgent.
  • Your team is relatively healthy and doesn’t have an ongoing crisis on their hands.

This still leaves some room for variety: your rotation might just be responsible for triaging tickets rather than addressing them. Your secondary might have a very active role. No matter what, I hope you can find a useful way to apply this to you and your team.

Interrupt Rotation Anti-patterns

By anti-patterns I mean styles of work that are counterproductive or potentially hurtful. Each anti-pattern is described below, and alternative, positive patterns are suggested.

Anti-pattern: Doing Project Work During Your Interrupt Shift

This is probably the hardest anti-pattern to break, so let’s get it out of the way: Stop trying to do project work and interrupt work at the same time. Tell me about the time you simultaneously did both jobs well.

Avoiding project work while on rotation is a common practice among Google SREs and is well covered in Bad Machinery: Managing Interrupts Under Load. (This is recommended viewing and reading.)

While there’s always an interest in reducing time engineers have to spend on tickets, dealing with tickets is a reality of many teams, and you’re expected to budget that time appropriately. If your standard project work really cannot wait, then you should not be primary interrupt, and you can rearrange your priorities so you can handle tickets another week. Talk with your manager about choosing priorities. (It’s also possible an individual interrupt rotation is too long and your team needs shorter rotations, but the anti-pattern here is more about you being unable to withstand the pressure of your normal work routine.)

When the ticket queue is empty, use your time to Improve The Rotation.

A side note about expectations: a big reason people in the primary interrupt role do some of their normal project work during the rotation is because they feel like they’re letting the team down. That’s a normal feeling, but it’s often wrong, because what you’re doing is taking on other work so the rest of your team can move forward. And they should be grateful that you’re doing that.

Alternative healthy patterns:

Anti-pattern: Working Ridiculous Numbers of Hours on Tickets

Let’s assume at this point that you’re not doing project work at the same time as tickets. And yet tickets come in faster than you can handle them. “I’m a good primary interrupt,” you say, “and my team needs me!” So you work late one night. One night leads to two nights, leads to an entire week of working late, foregoing your personal time, and now you’re just cranky.

I had a teammate who literally worked sixty hours a week during their interrupt rotation. They finally had their limit and said “If I have another rotation like this, I’m going to quit.” Now, some of that was due to their precise work style and strong sense of self-reliance, and some of it was due to receiving a higher load of tickets than normal.

It doesn’t change the fact that some people have a harder time with the type of work an interrupt rotation entails. Part of why people become software engineers is because they like the type of work patterns that come with software engineering, with longer periods of thought lacking the emotional distress of switching context.

Regardless of your work ethic or incoming workload, it’s neither in your interest, nor the interest of your team, for you to handle all these tickets yourself. That amount of effort breeds resentment and frustration, and nobody wants that.

If you find yourself taking that long to resolve tickets (no matter what) try one of the alternatives below.

Alternative healthy patterns:

Anti-pattern: Silently Struggling

If you find yourself wading in unfamiliar territory and having the time of your life, then this anti-pattern doesn’t apply to you. If you find yourself wading in unfamiliar territory and getting increasingly upset, then listen up.

Please don’t silently struggle through your interrupt rotation shift. That’s definitely not what your shift is supposed to be about. Reach out and ask for help. You can get from many directions.

Alternative healthy patterns:

Anti-pattern: Taking Steps That Make You Uncomfortable

Some tickets touch on some tricky and sensitive areas. Some tickets still require you to do surgery right in the database. Or you may be faced with an unclear playbook entry. Or perhaps the requestor is asking for something you believe is outside the bounds of proper system use.

Or perhaps you think we should consult someone on your privacy team or legal team.

In all these cases, you might be faced with choosing to move forward without a clear path because something is marked urgent, or because you have too many tickets in the queue. Don’t do it! Read some of the sections below, instead!

Alternative healthy patterns:

Anti-pattern: Ignoring Difficult Tickets

Leaving incomplete tickets for the next interrupt — well, it’s not a cardinal sin, it happens often enough. But here’s a recent event that shows how it can go poorly:

A customer requested a specialized extract of available, but relatively short-lived data. Since the ticket was marked low priority, and we didn’t have a good playbook for the information, it remained untouched through the shift. The next primary interrupt, well, they didn’t touch it either, nor did the two after.

There was very little ephemeral data left for the unhappy customer when someone actually took action on the ticket.

Alternative healthy patterns:

Interrupt Rotation Patterns

Consider these work styles when you’re overwhelmed with interrupt tickets.

Pattern: Don’t Worry! Everything Is Cool.

There are problems, and then there are real problems. Unless you received a ticket revealing a severe breach of your system, or if confidential data is made visible to the outside world, you’re not dealing with real problems. (I’ve been fortunate to avoid catastrophes like these thanks to well-designed systems from before my time.)

Most of your tickets will be boring, and if you’re fortunate, easy to solve.

It’s not your fault if someone clicked the wrong button and desperately needs you to change data before they get embarrassed, and it’s still not your fault if it isn’t fixed in time. Just stay cool and do your best.

Oh yes, sometimes the tickets are about something going horribly wrong. As one colleague said:

“We’ve had cases where our alerting sucked and onduty ended up getting an innocent-looking wisp-of-smoke ticket that turned out to be the world on fire.”

And yet I’m still telling you to stay cool. In cases like that, best thing to do is locate your primary on-call and share what you can with them. And then it becomes an incident. Incident management is another topic entirely and addressed by people much smarter than me.

Not to put too fine a point on it, but we’re not supporting pacemakers or nuclear power plants here, right? (If you support pacemakers and nuclear power plants, please write in and teach me something.)

Pattern: Be Nice! Interrupt Is A Customer Service Role

Human beings won’t write their tickets very well. It happens all the time. And they won’t prioritize their bugs just right. They’re not stupid, they’re just experts in different things than you are. After a little exhaustion, you might find it harder to give requestors your best self.

Consider that your job at this point is that of a customer service representative. I like this definition of good customer service:

Good customer service means helping customers efficiently, in a friendly manner. It’s essential to be able to handle issues for customers and to do your best to ensure they are satisfied.

Before you do any work, make sure you understand their request. If the data is unclear, suggest an alternative. If you think their ticket’s priority is too low, change it to reflect what they want and need.

And if you’re having a hard time because you need a break, Don’t Worry! Everything Is Cool. Go take a break.

Pattern: Find the Best Person To Address the Problem

In most rotations, you should not be responsible for doing all the work on a ticket. You might not have the expertise, the playbook or access permissions. In cases like that, the solution is to pass the ticket on. If you don’t know who to assign it to, try to Rely On Your Team.

This can easily turn into an anti-pattern of sorts where you don’t put much effort before reassigning work. Please do your due diligence.

Pattern: Rely On Your Team

People think being the primary interrupt means “don’t interrupt anybody else.” But asking for help is a fine part of a healthy interrupt rotation. Sure it’s good to be competent at your job, but given that interrupt rotation can often place you in areas outside your expertise, you have to do something about it.

Just remember: you are not alone. If you find yourself stuck, here are places you can go:

Use your secondary

This is important: your secondary is there to help you. If you have a ticket you don’t understand, ask your secondary for help. If something really requires immediate assistance, ask your secondary for help.

Depending on your workload, a secondary might rarely get involved with interrupt duties until they become the primary. And even if they don’t budget time for tickets, and if they aren’t expected to do tickets, if asked, secondary should nevertheless be available and happily (not grudgingly) prepared to do tickets. Primary’s only asking because they don’t have the ability to do all of the tickets that are there.

In other words, secondary changes nothing in the default case, but if you need help, your secondary is there for you. (If it does start coming up regularly that secondary gets asked to help, then that’s a problem and you should address it.)

Use your team

If your secondary is unable to help, get more support from anyone on your engineering team. Whether it’s an expert on a neighboring team, your tech lead, your manager, or your team’s SREs or other production experts. Use a team communication channel to broadly request help — that’s what it’s there for.

Use a buddy

Sometimes you are concerned that you’ll fat-finger the wrong data. If that’s the case, find someone to sit next to you and verify each step. This particularly works when Take Only Comfortable Steps doesn’t cut it.

Pattern: Push Back To Collect Details

This seems like an obvious Pattern, asking the requestor what they mean, but it surprises me how often people won’t do it. It’s not your job to guess out what they’re saying, but it is your job to get the right information out of them.

Maybe they’re new to the rotation, or the request is for an area they’re not yet adept in. Or, if your teams are like mine, you’re taking tickets for products that belong to sibling teams outside your own area of expertise. Leaving it unaddressed means Ignoring Difficult Tickets, and nobody wants that.

On the other side of the request, sometimes the requestor doesn’t make clear which tool or product they are using. Or they’re using business jargon specific to their team. So if there’s something you don’t know, just ask the requestor for more details.

I’m sorry, this isn’t my domain of expertise. What do you mean by TFN?

Or, once you think you understand, repeat it back to them for confirmation.

If you notice a repeated pattern of the same kind of requests, it might be a good time to Document Your Progress and the Lessons You Have Learned, or even Improve The Rotation.

Pattern: It’s OK To Decline Invalid Requests

Sometimes the response to a request is, “I’m sorry, we can’t fix that,” and sometimes it’s “I’m sorry, but this will take days to fix,” and once in a while it’s, “I’m sorry, that’s not ethical.” It costs nothing for someone to say their request is urgent and important, but we have to weigh that against the time it might take to fix, and the value it would bring. Your team (probably with advice from a lead or manager) can decide it’s not worth fixing.

Along similar lines, you might be asked to clean up the mistakes a user made just so they don’t look bad, or you might be asked to change some sensitive data that you shouldn’t. It’s fine to respond with “I’m not sure I’m allowed to do that. Let me check with my manager.”

You are always welcome to push back, and you can Rely On Your Team for support or as an escalation path, or speak with your tech lead or manager to get clarity or handle sensitive requests. Also Take Only Comfortable Steps.

No matter what you do: Be Nice! Interrupt Is A Customer Service Role.

Pattern: Take Only Comfortable Steps

Tickets are almost never clear, especially those written by human beings. And sometimes that human might not know what they want. Or they want something unethical. Or the playbook entry is out of date, requires skills you don’t have, or worse, there’s no playbook at all.

Find an expert. Find a buddy. Write down your plan, show it to a buddy, and test your steps in a non-production environment.

And most important, please don’t rush. Real emergencies are life and death situations. This is not an emergency.

Assuming you already tried Find the Best Person To Address the Problem, and Rely On Your Team (particularly Use a buddy), the next step is don’t do it. If you’re not comfortable, don’t do it. Find your manager, explain the problem, and listen to their advice. You’ll either get simpler steps, another buddy or guidance on kindly letting the requester down.

Pattern: Document Your Progress and the Lessons You Have Learned

Consider this made-up incident: you are asked to change some data in production. You run a command that changes the data, and close the ticket. Later you’re told you changed the wrong record, and have to restore that record to its original state. But you have no information about the original state, so the only option is to for someone to restore the database. For one not-terribly-consequential record.

Here’s another one. It’s the same set-up, but you aren’t told that you changed the wrong record, but something broke, right around the time you ran your command. While chasing down the root cause they want to know what you actually ran and what it actually did.

In both of these cases, having a record of your changes helps a great deal. In fact, whenever I have to change data, I follow the same four steps:

  1. Run a command that shows the data in the prior state
  2. Run a command that changes the data
  3. Run a command that shows the data in its new state
  4. Save the information, either in a document, or in the original ticket.

There’s another great place to put this: in a diary. Does your rotation have a diary? Put your notes in there. And even if they do, keep your own.

And please, write it down while you’re doing it, and not at the end of your shift when you’ve had plenty of time to forget it.

This is of particular value when you take steps not listed in the playbook. In fact, take what you’ve learned and Improve The Rotation.

Pattern: Budget Your Time

There are entire books on the subject of time management. I’ll do my best in under 500 words. First, I will once again recommend Bad Machinery: Managing Interrupts Under Load, which comes from Google’s SRE teams.

There are typically three reasons why your normal project workload may draw you away from interrupt: Standard project work, Meetings, and Short interrupts for questions and conversations.

Standard project work

Make sure your team expects you to do no work while you’re primary interrupt. When your team asks about your availability that week, it’s none. If you’re lucky to have some extra time, see what you can do Improve The Rotation.

Meetings

On my team, if you’re primary interrupt, you are welcome and encouraged to consider all meetings optional. Here’s a little tactic you can use when faced with push-back.

Them: Hey! Let’s have a meeting!

You: Sorry, I can’t. I’m primary interrupt.

Them: Really?

You: Really. There’s no emergency, right?

Them: There’s no emergency, but really really?

You: /shrug

If they can’t move the meeting, ask someone to attend on your behalf.

Naturally it doesn’t always work, but it works more often than most people realize.

Short interruptions for questions and conversations

It’s nice to have people nearby to talk with, but you have to be able to shut everyone out and focus for an hour or two — you’re entitled to sit at your desk uninterrupted for periods of time. Not only that, it’s critical to have that kind of time to get into the flow. This study suggests that any interruption increases stress and frustration. You can control the incoming flow of information by either closing your email tab or muting your instant messages.

And then we get to being interrupted in-person. Without a visual cue it’s not clear to other people that you’re trying to focus.

Open plan office rule #1: you do not speak to someone wearing headphones unless the building is on fire.

I use headphones. They go on when I want to focus and not be disrupted. That doesn’t mean I can’t be interrupted with headphones on, but if it’s not urgent, the person might have to wait. Here’s an example conversation, when someone gets my attention with my headphones on.

Me: Is it urgent?

Them: No. I just want to…

Me: Is it possible for this to wait until later? I’m trying to focus on these tickets.

Them: Oh, sure. Let me know when you’re free.

You might hate headphones. So try a sign that says “Busy, please do not disturb.” Whatever, make it obvious to people that you’re not available. Perhaps you need an agreement across your team. Start the conversation.

Pattern: Improve the Rotation

Of all the Patterns this is the one that provides the most value to your team. It makes the rotation easier for the next person, and sets the example to your team of going beyond your immediate responsibilities for more impact. Most people have heard of the Campsite Rule, the Scouting Rule, or the Boy Scout Rule, which is:

when camping, leave a campsite in a better condition than how you found it.

But here, leaving something in a better condition means updating the playbook, automating a process, fixing a tool, or learning a new way to solve common problems. Or, you could even go one step beyond and look at what’s causing issues in the first place, and make significant improvements that reduce ticket counts! Over at FullStory they embody this concept in a much larger sense. They call it Bionics. Read it! See if it’s something you want to apply to your own team.

Conclusion

The interrupt rotation can be a software engineer’s torture chamber. If you’re in a healthy work environment, that torture is self-imposed either through ignorance or habit. Good ticket management comes down to some basic self-care: Do what you can do, get help for what you can’t do, and make things easier for the next time.

Acknowledgements

Thanks to these people for their detailed and thoughtful comments: Tanya Reilly, Matt Mastracci, Irene Chung, Jon Bright, Pete Matt. Without their help, this document would be a sloppy bowl of alphabet soup.

This is a vastly expanded version of something I wrote for my team at Google in 2014. Thank you to everyone who saved me from hating every minute of my interrupt duties.

--

--

--

Software engineer. Opinions are my own.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AWS — Difference between SQS Standard and FIFO Queues

Become Paperless With Paperwork

Papers, macbook, flowers on a table

System design paradigm: Caching

Autoscaling & Ingress Dynamic Load Balancer

Codefresh versus Jenkins

Selenium Framework — Data Driven, Keyword Driven & Hybrid

how to make a discord meme bot using 15 lines of python code.

Using environement variables in a Flask + Heroku project

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Robert Konigsberg

Robert Konigsberg

Software engineer. Opinions are my own.

More from Medium

Why Refactoring?

Is Test Driven Development Really Worth it?

Writing Testable Clean Code

Flexibility vs Complexity: Open/Closed Principle