Ultra Modern Incident Management

Travis DePuy
True Tales from Engineering
4 min readJul 19, 2019

Bring automation to the things that can be automated so you can get back to doing something creative. This blog article originally appeared on xMatters.com.

The Ancient History

I started my career at Peregrine Systems, doing 2nd level technical support for Service Center which was an application for your standard ITIL process such as Incident, Change and Request Management.

As a support engineer, we were on-call for priority issues and we managed who was on-call when using “The Sheet.”

The Dreaded On-Call Sheet

Anyone who did on-call work in the early 2000s is probably pretty familiar with “The Sheet.” It was one Excel sheet residing on a shared drive, and it was the source of record for the on-call schedules by hour, day and week. And man, was it a nightmare. People would leave work and forget to unlock “The Sheet” without saving their changes ,and we would need to work off an old copy and hope for the best. Invariably, someone would call in sick or go on vacation, and “The Sheet” would still have them in rotation — which we only discovered when they didn’t answer the phone after three tries and someone mumbled something about them being in Hawaii.

In the timeline of an incident, every moment is critical and organizations can’t afford to burn through 20 minutes before even starting to triage.

Often the incident involved more than just one team, so then we’d have to track down a person from the DBA or Network teams. These teams had their own versions of “The Sheet,” and the effort to find an on-call resource would begin again. Eventually, the teams would be assembled

As the timeline progressed, the teams would resolve the issue and work to restore service, but not always on time. In hindsight, I would call this Ancient Incident Management.

The Modern Era

Later, when I was introduced to — and then started working for — xMatters, I found a much more resilient tool for scheduling on-call resources. As pagers started dying off and the world moved to smart phones, the efficient ways of tracking someone down or delivering information helped to reduce the incident timeline even further.

Cutting out much of the manual effort in favor of automation brings incident management into the modern era, and so we call it Modern Incident Management. With modern tools, teams were able to drastically cut down on the time to engage the right people and assemble representatives from different teams, both of which affect the impact and timeline of an incident.

Modern Incident Management — a more elegant weapon for a civilized age

This is all well and great, but today is a different time. Organizations are adopting digital transformation to accelerate all aspects of the business, and the Incident Management process is one of the most critical processes in any part of the organization. It needs to continuously modernize and streamline as tools mature and manual tasks become automated. This is critical for maximizing the uptime of lines of business.

Ultra Modern

On-call schedule tools aside, how many other tools does your incident process touch?

Once the monitoring tool triggers an alert, what happens then? Do you paraphrase the alert to open a Jira issue? How about a ServiceNow incident? Does your process require both?

If the incident is severe enough, do you click the friendly + icon to create a channel in Slack, followed by an @ mention to the relevant people? Do you know who is currently on call?

Do you open your Statuspage application and enter the relevant details? Do you write an email informing the Stakeholders who hold stake over your job?

With so many teams to keep informed or collaborate with, people manually shuffling data from one place to the next instead of taking more productive action, the process might look something like this:

Manual Touch Points — people pushing clipboards

How much time does updating all these tools take? 5 minutes? 10 minutes? More? Again, this is all time that could be better spent supporting the troubleshooting efforts or preparing the post mortem.

Automating manual steps will cut even more time in the process, help smooth the rough edges, and ensure consistency. So instead of all the human touch points above, you now have automated touch points in the diagram below, displayed with (x) icons.

I’m terming this Ultra Modern Incident Management. It is another melding of human process with automated steps and can further cut down an incident lifecycle.

Evolving out of the ancient world of spreadsheets brings the process into modern times with modern tools. Further evolution involves keeping the tools up-to-date with the processes they support. In these days of many diverse tools across many teams a new paradigm is needed: Ultra Modern Incident Management.

Automate the things that can be automated so you can get back to doing something creative.

--

--