Create an Incident Response pipeline using epiphani

Pawan Uberoy
epiphani
Published in
4 min readSep 2, 2020

Hello Reader,

We now have pipelines for everything: deployments, CI/CD, Oil. It would be helpful to create automated pipelines for managing incident response too.

Sometimes people miss steps when they are debugging things in the heat of the moment, sometimes the SME is not available, at other times, some engineers are new and they feel under a lot of pressure.

With Epiphani Playbooks, you can create a pipeline that will handle the stages of incident response in a very deterministic way. It will also help create knowledge for future incident by preserving all the important things from this investigation.

You can easily integrate any existing scripts or automation into the pipeline using the e3 connector from this blog.

Incident Response Pipeline

Many organizations have an informal process they follow for incident response. An incident response pipeline not only creates structure around the process but automates it too.

The pipeline can have any number of stages. The one defined above is an example. I will walk through the stages below.

Incident Trigger

The incident can be triggered in several different ways. One of them is through pager duty, another one described in this blog is through Kubernetes alerts.

Once the trigger is received by the epiphani engine, you can associate a playbook to execute automatically.

In this example it is showing an event coming in from pager duty and creating:

  • Github ticket to track this incident
  • A Zoom meeting for people to collaborate
  • Another pagerDuty event to get the right people on call
  • A slack channel with all the right people in it to talk

It then sends all of this information to the team room so every one is aware of the incident and can join if needed.

Running a Health Check

Typically when the SMEs login they have go collect a lot of information from terminals and dashboards. It can all be captured in a playbook to run automatically when an incident that matches certain rules are found.

This is a simple health check playbook that collects metrics from Splunk and AWS and sends a notification to say if it found something wrong or not. It posts all of the output too. You can also use the e3 connector if you want to integrate any exisiting scripts.

Get the team together

One of the most important things to resolve an outage is getting the right people together quickly and getting them the info they need to solve the problem or point in the right direction.

People can create playbooks that collects information using connectors and push that information into the page. That way they can get all the information they need even before they login.

In this example, the playbook gets information about AWS EC2 instances, some data from MYSQL and publishes them to the call for that team.

Resolve

Here you can have a bunch of playbooks that check your micro-services and fix them. You can use the epiphani connectors, your own scripts and playbooks to fetch or remediate problems.

In this example, it is checking OVS flows on a VM and fixing it if it is missing a subnet.

You can also create a playbook to periodically update stakeholders on the state of the incident. That way they are kept up-to date and valuable developer time is not used to do mundane things.

Learn

One of the most important things you can do before an incident is closed is to preserve all of the information that was used to resolve and find the problem.

Epiphani has a scroll that automatically captures everything that is happening in the incident, it also has a bot that can provide hints from previous similar incidents to assist.

You can also have playbooks to help take the captured information and upload them to a ticket or a service so that it can be easily accessed outside the epiphani subsystem.

In this example, it is getting logs from AWS, from Splunk as well as from a Juniper Virtual Router and pushing them back to the Incident ticket that was created.

Thanks for reading so far. I hope you see how creating a pipeline with automation helps resolve incidents faster. It also helps capture tribal knowledge into automation so it can be used when they are not available.

Please visit https://epiphani.ai to try this out or email us at feedback@epiphani.ai if you have questions or comments.

Thanks!

--

--