Automate incident response with an in-house Serverless-based Slack Incident Bot

Authors:
Shawn Lim, Principal Software engineer, Funding Societies
Suparshva Mehta, Software engineer, Funding Societies

credits: created with https://www.canva.com/

A production incident is an unexpected event that negatively affects customers and disrupts the normal operation of a company’s services. It can impact the company’s reputation and even have financial implications on revenue.

Production incidents can happen at any time, so it is important to be prepared and have a proper incident response plan in place to handle them effectively.

In this blog, we’ll explore why we decided to build Funding Societies’ in-house incident bot, the technical aspects of building the bot by leveraging serverless technologies and the results and next steps.

Problem Statement

At Funding Societies, we have a dedicated incident response process and defined roles and responsibilities. Let’s understand the roles and responsibilities we have.

Responsibilities and roles:

  • On call engineers →These engineers are the first responders to raise issues/alerts. They will determine if an issue is severe enough to escalate to the respective technical leads.
  • Technical Lead → These are usually engineers who are subject matter experts, and with their guidance it is identified and confirmed whether an issue is an incident or not. They are responsible to drive recovery strategies and investigations during an incident.
  • Incident Commander → These are usually engineering leads or senior engineers who drive incident management and delegate tasks to efficiently resolve the incident, conduct postmortems, and update necessary stakeholders with accurate information.
  • Scribe → The person assigned with this role is usually an engineer who takes notes continuously throughout the incident timeline.
  • Communication Liaison → These are engineering leads or engineering/product managers who communicate information with the appropriate stakeholders after discussing with the incident commander. They drive communication with various stakeholders, such as the compliance and risk team, customer experience team, security team, and business team, ensuring that all necessary channels and stakeholders are updated.

As we can see, each of these roles have many responsibilities on their plate. Our aim was to alleviate the stress caused during an incident on the employees by:

  • Standardising our incident response
  • More focused to respective roles
  • Faster and streamlined by automating wherever possible

Exploration for existing tools

We noticed that we interacted a lot on Slack, Pager-duty, Jira, Confluence during a production incident. Organising information and notifying the events timely to right stakeholders is crucial, but we can miss or forget in the heat of moment during an incident. We were already using some of the slack apps and bots, so we started exploring for options on the same lines.

We explored available options in the market for incident management and came across options like:

  • https://rootly.com/. The tool seems pretty neat and does exactly what we initially planned to automate, but didn’t fit our budget requirements.
  • Datadog — We were already using Datadog for observability. But we disregarded this approach, as not every stakeholder will have access to Datadog. Datadog integrations do exist and we use some of them, but we wanted an interface like slack which is accessible to every stakeholder.
  • We also explored good open source implementations and came across repository: https://github.com/echoboomer/incident-bot. This checked all our boxes. The code seemed easy to understand and nicely written, so we decided to extend it for our needs. We thank the developers https://github.com/echoboomer and https://github.com/aflansburg for building such a great tool.

Code changes

The Slack Incident bot can be built using a serverless architecture that leverages AWS services with some changes. We wanted to have a serverless architecture because the incident bot would rarely be used so it is a waste to pay for all that idle time.

Here are the changes we performed over the open source code:

  • Move the database dependency from PostgreSQL to DynamoDB
  • Switch from server mode to serverless using EventBridge, SQS, AWS Fargate, and DynamoDB
  • Instead of a web app, make use of Slack app interface as our UI leveraging the slack-bolt Python library, essentially slack bot will call our apis
  • Use LocalStack and Docker for testing
  • Introduce Pulumi for deploying infrastructure instead of Kustomize or Helm that the library used
  • Add new commands as per our use cases

Architecture

The Incident bot architecture consists of the following components:

Slack app

The Slack app acts as the interface for the Incident bot. Users can interact with the bot using Slack commands, and the bot can respond with messages or notifications.

AWS EventBridge

AWS EventBridge is a serverless event bus that can be used to route events between different AWS services. We use EventBridge to receive and process events generated by the Incident bot.

AWS SQS

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables decoupling and scalability of micro-services, serverless applications, and distributed systems. We use SQS to decouple the processing of incoming messages from the Incident bot.

AWS Fargate

AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). We use Fargate to run the Incident bot’s containerised code.

DynamoDB

Amazon DynamoDB is a fully managed NoSQL database that provides fast and predictable performance with seamless scalability. We use DynamoDB to store incident data, such as incident status and relevant timestamps.

Deployment

We use Pulumi for deploying the Incident bot’s infrastructure as code. Pulumi provides a simple, yet powerful way to define infrastructure as code using familiar programming languages like Python, TypeScript, and Go. With Pulumi, we can easily deploy and manage the Incident bot’s infrastructure in a repeatable, scalable, and auditable way. We chose Pulumi over terraform due to reasons here. With Pulumi, our infra code is unit testable which significantly boosts productivity and confidence, unlike terraform where we need to rely on integration tests.

Results

At the time of writing this blog, the bot is already live for sometime now and we see a great increase in our productivity during an incident.

On discovery of an incident, the on-call engineer will interact with the incident bot and the bot will create an incident channel, add a meeting link, add a confluence page for starters. We can update incident roles, status, severity and they get synced automatically with confluence page. On marking of incident as resolved, we have an option to attach discussions to the confluence page.

Screenshots

bot help command
incident creation and management
incident export chats and discussions and list command

Further extensions

There is plenty of room for improvement further. Here are some further extensions to the Incident bot that we have planned:

  • Move the architecture to utilise AWS Lambda instead of AWS Fargate because AWS Fargate still bills us for each provisioned container. AWS Lambda is truly serverless where we would pay only when an incident is happening.
  • We currently use many third party vendors. If any of them face an outage, our customers can get impacted. The current process is to subscribe to those events and the SRE and Devops engineers would take necessary action on getting notification. Create a watcher which will periodically crawl/subscribe to status/incident pages of our third-party vendors and collate in a dedicated channel and send a pager-duty notification to respective on call engineers, so we can take action ASAP if needed.
  • We currently have a feature to export slack discussions from incident channel and attach to Confluence Incident page. We want to improve it further by summarising regularly, further reducing the scribe effort needed.
  • Automate L1 support by having a simple rule engine based scheduler job to automate assigning/resolving of production tech support issues to relevant team without human intervention and also suggest similar old issues from JIRA for reference. Attach necessary playbook if it was found to be a similar issue.
  • Leverage Incident bot to update our websites and app with banner to notify end-users of an ongoing incident if end-users impacted by toggling feature flags with appropriate message.

--

--