Optimizing Your Critical Incidents Management Process

Published in

HiredScore Engineering

6 min readMar 15, 2021

Incidents happen. They always have and always will. What qualifies as an incident? One of your critical databases is not accessible, or suddenly all of your users can’t log in to your system.

Issues like these require an “all hands on deck” approach from relevant stakeholders in the company, such as Developers, Support, Customer Success, and more. These issues usually mean that your system/critical flows are not working at all, or not working as expected, resulting in very unhappy and unsettled customers.

Incidents could be caused due to a human error, a problem from the client’s end, or even a lightning storm that destroys half of your servers.

Handling an incident is a very chaotic situation and while the key goal is to return to normal as soon as possible, it usually results in too many people trying to solve the issue, wrong or inaccurate information being shared, stakeholders not being informed. This leads up to overlook meaningful lessons that would help us to improve in the future.

This is the crossroad where you decide — will it be a crisis or can we leverage this situation into an opportunity?

By creating a proper flow with clearly defined actions, responsibilities, and roles, and later on, automating it, we can definitely turn this into a successful process where we learn and improve in the future.

How does an incident flow typically look like?

Most organizations feel the same. The incident flow usually looks like some sort of “organized chaos” — things eventually work, but you know deep in your heart that there must be a better way to handle it.

Does this scenario sound familiar? A customer sends an email to a specific person in the company, who then sends a global Slack message in some channel (for us, it’s the #incident channel). An assignee is asked to own the issue by a Group Leader / VP R&D and then the investigation starts.

The incident is finally over and all the communication stays in the Slack channel. A week later the cycle repeats itself because we didn’t have a proper post-incident learning session.

You can definitely see this process is broken in a lot of ways. In theory, the “process” should be straight-forward, as can be seen in the flow below. However, most of the time it’s not.

Some clarifications:

What is the “Incident Channel”? A single place for all relevant stakeholders to communicate.
Who is the “Incident Manager”? The Incident Manager is the person that is responsible to encompass all the relevant stakeholders (Support, Customer Success, Developers, and more) and ensures the incident is being handled properly by all participants. The Incident Manager is not responsible for solving the issue themselves!
What is the “Manager Summary”? A summary sent by the manager to all of the affected stakeholders. The summary includes all the relevant information regarding what happened during the incident.
What is a “postmortem”? A session between all the relevant stakeholders that were involved during that incident, to discuss what transpired throughout the timeline and identify where the team can learn and improve in the long run.

What can we do better?

Our goal is to make sure that the Incident Manager is focused on getting things back to normal as soon as possible — we want to try and make their life easier. Firstly, by making sure that the flow is being followed properly, and later on, by automating as many of the steps as possible.

In addition to that, I think the most important part of an incident is the “after the incident” part. We have to learn from the incident as much as we can, and by creating a flow and templates that will ensure the quality of our learnings, we can definitely achieve that. As part of it, we wish to make sure the following items are met:

Increasing information quality and availability for all of the involved stakeholders.
Allow the incident manager to organize all important information in a Single Source of Truth.
Automating post procedures such as the postmortem template and Management Summary emails.
Generating reports, statistics, and KPI’s.

From here, we start our journey together.

How do we do it?

We start by creating a new ticket in our JIRA ticketing system, titled “Production Issue” — This ticket holds all the information that we will use in our automations:

Things you know when you open the incident:

Who will be the Incident Manager
A detailed description of the issue
The severity of the issue
Who detected the issue (the client, monitoring, an employee).
List of impacted systems
Relevant Stakeholders
Relevant Timeline: First report and Escalation dates.
Potential # of users impacted
Number of user reports
Production Issue Impact
Workaround instructions

Things you do and identify during the incident:

Proactive communication information that was sent to users.
Any Status Page updates that were made.
The time you understood the Root Cause.

Things you need to do when the incident is over:

Filling the Root Cause for the issue.
Filling the First Occurrence and Resolution dates.
Filling the Manager Summary.
Scheduling a postmortem.

Then, using code and the built-in tools JIRA provides, we can create all the needed automations and bots.

What can we automate?

All the notifications — Why do you need to send a Slack message that an incident has started? Why do you need to update the channel every time something changed in the event flow?

Automatic reminders

2. Automatic Manager Summary — We are human, we make mistakes, especially late at night. You don’t need to remember to send a manager summary anymore, simply close the ticket and the manager summary will be automatically sent out.

3. Writing the postmortem — It’s very tiring work writing a postmortem. There is a lot of information you need to pull from a lot of different locations and compile them into a single page. With automations, everything will be pulled automatically into your postmortem page on Confluence.

The result?

By organizing, and then automating the incident flow we managed to tackle two important pain points — Quality and Speed.

Incidents are now resolved faster, are clear in every step of the way for all the stakeholders, nothing gets lost between the cracks, and the most important part — we know how to improve ourselves for the future. We do this with the help of reports and high-quality KPIs and measurements that assist us to track and analyze historical data and are integrated as part of the process.

WIIFM (What’s In It For Me)?

You can turn every crisis into an opportunity, especially with automatic tools that can make your life much easier.

By having a clearly defined process, designating stakeholders and information points, we can then automate it to enable people to focus on handling the issue rather than on the “meta-process” behind it.

And what about you? Can you identify your company’s pain points and think about what the robots can do for you?