Incident management with a bot

Published in

ManoMano Tech team

6 min readMar 19, 2020

I arrived at ManoMano in January 2016 as a Lead Developer and after a year went on to take the position of Lead SRE. And experienced a multitude of incidents since I started working, as a developer at first then as a SRE. I came to the conclusion that we were lacking tooling capabilities to manage them.

Fire Truck in San Francisco — Photo by François LASSERRE on Flickr

Incident Management

Today, IT services must be available 24/7, but incidents are inevitable. If something goes wrong, the on-call person is called to get the service up and running as soon as possible.

Why is the time notion really important? During a downtime, you lose money every second.

For example: You have an e-commerce website and there is an incident on your cart page, so your customers will not be able to finish their purchase until the incident is resolved.

In fact an Amazon downtime cost on average $31k per minute in 2008, in 2020 it’s about $220k per minute which represents approximately $13M per hour (source: gremlin.com).

Fire Truck in action in San Francisco. — Photo by François LASSERRE on Flickr

FireFighter: an incident bot

Incidents are by definition unexpected, it is impossible to predict them, in this context it is necessary to prepare and train people to manage incidents in order to save time. It is always a stressful situation because each incident is unique and requires several skills: team leadership, communication, troubleshooting…

At ManoMano it was decided to create a bot to help manage incidents. It’s called: FireFighter. It’s a Slack bot, which allows automating a lot of things to save time and avoid forgetting any steps. The teams are therefore focused on the incident and not how it should go.

Now let’s define the different incident stages.

Tools to manage an incident

Incidents cannot be avoided and it’s imperative to have a proper management workflow. To do so, multiple tools can be used like:

an alerting / monitoring system
incident tracking
chat room(s) / video chat
a documentation tool
a status page

The tools we are using to mitigate incidents are:

Confluence: for documentation and postmortems
Datadog: infrastructure and applications monitoring; alerts are sent in a dedicated Slack channel or in Pagerduty if it is more critical; notebooks are used to display graphs during the incident
Pagerduty: to trigger on-call. It also gets the name, phone and email of the on-call person and invites them on the Slack channel and display the name in the documentation
Slack: to track the incident status, level and for written communication
Zoom: for audio communication

The workflow

Currently there are 6 statuses:

open: Initial status when the incident is created
investigating: Looking for a root cause
fixing: Issue’s root cause found, work has started
fixed: Fix(es) have been applied in production (incident solved)
postmortem: Postmortem is planned, calendar link must push on the channel to inform everyone
closed: Everything back to normal, incident is solved and postmortem meeting is done (only if needed, see below)

There are 5 levels of severity:

SEV1: Critical issue that warrants liaison with executive committee
SEV2: Major issue impacting customers
SEV3: Minor issue affecting customers
SEV4: Minor issue not affecting customers
SEV5: Cosmetic issue not affecting customers

Postmortems are only done for incidents with severity level 1, 2 or 3.

An incident has just occurred

When an incident occurs, the /incident open command is used to open a new incident in Firefighter.

There’s a strong incentive to open incidents. If it was a mistake or it is not relevant, it is closed immediately. It’s better to open an incident as soon as possible to have a better management, instead of waiting 10 minutes and discuss about it on many channels, private messages and sometimes lose these messages because they are not present on the specific incident channel.

Here is what it looks like in Slack:

window: open a new an incident in the application

As seen on the image below, the domain for this incident has to be chosen (payment, search, main website, backoffice…), along with the severity, a brief description, and a checkbox to trigger on-call.

When the form is submitted the bot will automatically:

open a Slack channel, invite the necessary people there, post a quick summary of the incident on the specific channel for this incident and on the global incident channel with the specific channel to join
create an event in Datadog to follow the incident and create a new notebook
add a message in the status page
if the on-call box is checked, activate PagerDuty

Incident update

During an incident, the status can be changed with the command /incident update

window: update an incident in the application

When the status changes, a message is posted on the incident channel and on the status page. The post is also copied to the global incident channel only if the status is “fixed”.

window: update incident roles in the application

Using the update command, the roles can also be assigned:

Incident commander:

acts as the single source of truth during an incident with the following responsibilities:
communicates on the same channel (Slack dedicated channel for the incident or zoom if it is needed)
gets all the information about the incident
delegates actions during the incident

Communication lead:

helps the incident commander to communicate during the incident to the different people
doesn’t need to know the technical side to fulfill this role
manages the incident’s status updates

Operations lead:

make the investigations, updates the code or configuration as decided by the incident commander
must know the technical side to fulfill this role (e.g: Dev, SRE)

Postmortem

A postmortem is a process intended to help you learn from past incidents. A successful postmortem process is based on a culture of honesty, learning, and accountability. Writing a postmortem is a collaborative effort and should include everyone involved in the incident response.

A postmortem template is used to always use the same format and make filling it up quick.

When the incident is fixed, it can be given the “postmortem” status. The FireFighter bot automatically creates a folder named “2020–03” in the Confluence postmortem folder only if it doesn’t exist. It then creates a Confluence page for this incident with all the incident information:

date of the incident
when the incident occurred
attendees
the timeline
people and roles: incident commander, communication lead, operations lead

Then a meeting is planned with all the attendees to talk about the incident and fill the postmortem.

Closing an incident

When the postmortem meeting is over, the incident can be closed with /incident close. This command sends the postmortem link on Slack to allow everyone to read it, then the incident channel is automatically archived.

window: close an incident in the application

What next?

With our bases now covered, we are investigating a few additional features:

- Extend the bot with a web interface for enhanced monitoring capabilities (incident status, duration, severity and so on…).
- More commands, like /incident info to get more detailed information about an incident.
- Generate the Postmortem’s timeline based on the incident channel activity.
- And more to come.

New posts are being planned to follow our progression on this work, so stay tuned.