Automated Incident Management Through Slack
How Airbnb automates incident management in a world of complex, rapidly evolving ensemble of microservices.
Incidents are unforeseeable events that disrupt normal business operations and are inevitable in complex systems that must be up and running 24/7. This is why it’s important to prepare and to train people to handle incidents in a timely and organized manner. Although each incident is unique, we follow the same procedure for detection, escalation, management, and resolution of incidents.
At Airbnb, we utilize a service oriented infrastructure which involves many interconnected services managed by small teams. Quickly figuring out what service is in trouble, and who to page is paramount to timely incident resolution. We found that our teams spent a lot of time switching between applications such as Slack, Pagerduty and Jira to raise an incident, page responders, and provide context. In order to have quick resolutions of incidents, we developed an incident management bot, a centralized automation tool for incident management.
Incident Management Slack bot
Our goal was to centralize incident management in Slack. Everyone at Airbnb is familiar with and has access to Slack, and it’s easy to bring people and resources together in an incident channel. In addition, the incident channel acts like a timeline of events which makes putting together a post mortem report easy.
Our requirements were as follows:
- Run in Airbnb’s service oriented infrastructure and have full support from our team.
- Standardize incident-related communications in all tools such as Jira, Slack, PagerDuty.
- Centralize incident management in Slack.
- Single intake funnel for incidents with clearly defined steps.
- Automate post-incident tasks such as setting up meetings and archiving channels.
- Provide incident timelines and metrics.
We decided to build our own app to meet our exact specifications and allow us to easily customize and develop further. We also chose to build the app in Golang, because of the great community, and their well documented slack library.
Finally, we decided to use chat commands instead of slash commands so that all commands sent to the bot would be visible to the members of the Slack channel.
Our incident management bot achieves incident response automation through four key commands:
- new incident <summary>: Create a Jira ticket and page incident managers.
- new channel <ticket>: Create an incident Slack channel for an open incident ticket.
- page <service|user>: Page the on-call(s) for a PagerDuty Service or a user directly.
- get timeline: Compile a concise timeline of important chat events for post-incident analysis.
Incident Response Lifecycle
We have defined four separate phases of an incident: detection, communication, escalation and resolution. Each of the bot’s commands automates tasks that would normally require coordination during these distinct phases.
Most of our incidents are detected by our monitoring and alerting tools, although sometimes we learn about incidents from our team members or customers. No matter how an incident is detected, having a single intake funnel for all incidents is crucial for effective incident detection. Our bot solves this by providing the “new incident” command.
New incident <summary>
This command creates a blank JIRA ticket with default settings and asks the user if they’d like to page an incident manager.
Regardless of the user’s choice to page an incident manager, a popup appears to the user asking for additional information.
This allows us to escalate incidents quickly while still allowing the incident responder to provide valuable information for the incident managers. These fields are optional in the interests of urgency and can be filled out later if needed.
Another important first step is to set up communication channels and provide as much context as possible to responders.
New channel [Jira ticket]
This command takes an optional Jira ticket as a URL or key. If none is provided, it will show the last 5 recently opened incident tickets for the user to choose. A channel is then created using the Jira ticket key, the summary as the title, and all incident managers are invited.
To provide context to all users invited, the channel’s topic is set to the Jira ticket link along with the summary of the Jira ticket. In addition, we update the Jira ticket with a link to the newly created Slack channel.
You may have heard about the Log4j security vulnerability which was characterized as the single biggest and most critical vulnerability of the last decade. Within 72 hours of vulnerability disclosure, there were reports of 840,000 attacks on companies globally, which turned into 100 internet wide attacks per minute over the following weekend.
At Airbnb, we have over a thousand micro services with hundreds of small teams managing them, which offered a unique challenge for us. We had to identify all vulnerable services, and quickly reach out to their respective owners for quick mitigation. This is where our Slack bot really shined, allowing our Incident Managers to quickly reach out to service owners and coordinate rolling out the fix much quicker than before. In a matter of minutes, the bot was used to page over 300 teams to assist with assessing impact and deploying patches. This equated to 4 hours saved compared to paging these teams manually, not to mention reducing the time spent in a vulnerable state.
Page <shortcut|service name|slack user>
The page command can be given a service shortcut, service name, or a slack user.
To get started, the user can view a list of shortcuts by typing in “page list”
Each shortcut corresponds to a PagerDuty service ID which will be used when creating a PagerDuty incident. The shortcuts are easily customizable by editing a YAML file.
If a user types in a service name which doesn’t match any shortcut, a search is done in the PagerDuty service directory and results are displayed for the user to choose.
Once a user chooses the service they want to page they’re asked to confirm and a new PagerDuty incident is created for that service.
We also allow paging Slack users directly for when additional responders are required outside of those on-call.
Once the page command is sent, the bot creates a new incident in PagerDuty with the Jira ticket, summary, and slack channel to provide context to the on-call person. After the on-call person is paged, the bot announces who was paged and invites them to the channel.
Once responders confirm there is no further user impact and a root cause is known, the incident is considered resolved and the team transitions to the post-incident phase. A robust timeline is required to have an effective post-incident review and an effective post mortem report.
This command will search the incident channel for all chat messages marked with a specific emoji which designates the message as a timeline event, and direct message the user a compiled timeline.
For example, we use the 📝 emoji to designate important events in the chat. As the incident is ongoing, anyone can add the emoji as a reaction to important chat events. Post-incident, the “get timeline” command will compile these chat events into an easy to copy paste timeline to be used in the post-incident report.
At Airbnb, we have after action review meetings (AAR) weekly where we review recent high severity incidents, post-incident reports, and ensure any corrective actions are called out and assigned. As soon as the Jira ticket tracking the incident is updated with the AAR meeting date, the bot will notify the person owning the Jira ticket when the meeting will be and what is expected of them.
Oftentimes, during our blameless postmortem process, tickets for corrective actions are created and assigned to teams to avoid similar incidents in the future. To encourage quick resolutions we set a strict deadline for these tickets. Our bot will send a warning message over Slack a couple of days before the deadline, and another message if the deadline has lapsed to the user assigned to the ticket.
Archiving Incident Channels
To keep our Slack workspace tidy, the bot automatically archives incident channels ten days after the incident’s Jira ticket has been closed.
Since launch, our bot has saved our Incident Managers and responders many hours through its automation and centralization of incident management within Slack. By measuring the average amount of time each task takes to complete manually compared to the bot’s automation, we determined an estimated 44 hours of time saved so far in 2022.
To further streamline our incident response from Slack, we plan to enhance our integration with PagerDuty.
Currently, every time the page command is used a new PagerDuty incident is created. Instead, we plan to unify all pages under a single PagerDuty incident to take advantage of PagerDuty’s incident metrics and to provide more context to responders.
Lastly, after a PagerDuty service is paged using the bot, we don’t have visibility of the status of the PagerDuty incident in Slack. Was the page acknowledged? Did the on-call not respond? Was it escalated and to who? We plan to build automation to follow the PagerDuty incident and report the current status to the incident’s channel. This will also allow us to record the timeline of actions taken in the PagerDuty incident after paging the service.
Attribution and Thanks
- Stephen: for being a great partner on the Airbnb Incident Management team and helping to define the incident management bot’s feature roadmap
All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.