A DevOps tutorial to setup your machine learning alerts with PagerDuty

In our last tutorial, we showed you how to detect anomalies using a combo of TICKscript in Kapacitor and Loud ML, and then how to trigger alert notifications via Slack. Not everyone is using Slack, so in this tutorial, we’re looking at PagerDuty instead.

In this tutorial, we’ll show you how to:

  • define a TICKscript in Kapacitor that sends your alerts to PagerDuty
  • trigger alert notifications via PagerDuty Event V2 API
  • let machine learning jobs automatically create and resolve the alerts

The pipeline is unchanged. We will use the same machine learning job defined in our previous tutorial (which we also demonstrated live during our DevOps.com webinar in early 2019, if you prefer the video). This basic ML job continuously monitors one single metric for normal and abnormal patterns, and the state of this metric is passed to a TICKscript defined in Kapacitor.

Part one: One time setup, link a PagerDuty account to Kapacitor

Only one change is needed to link your PagerDuty account to Kapacitor, and that’s in the kapacitor.conf file. The routing-key value in our kapacitor.conf template below will need to be replaced with your own value, as described in PagerDuty integration guide.

Here’s our very own sample which you’re free to use as a template:

All set, so let’s start the containers.

docker-compose up

Open the browser to the URL localhost:8888

Part two: Update the TICKscript to trigger intelligent notifications to PagerDuty

TICKscript is used to define pipelines for processing data. Our pipeline will receive input data from Loud ML predictions, and then output notifications if alert conditions are met.

Click on “Manage Tasks” on the sidebar to edit the TICKscript.

Our previous blog post contains a detailed explanation of all script sections. We used stateCount, which can increase the severity of the alert when it’s open for a long time. However, stateCount requires more code in the TICKscript, so in this tutorial, we’ve opted for a much shorter alternative over that functionality — changeDetect.

Let’s break it down into more detail:

  • We’re telling the script to read the is_anomaly boolean value available from the ML job. The changeDetect function will pass information to next node when the state changes from true to false (resolution), or the opposite (triggering).
  • The alert is assigned a warning severity level if the state is abnormal.
  • The alert is assigned a critical severity level if the score is higher than 90.
  • We format a message, and call the pagerDuty2 handler to actually send the information to PagerDuty service.

Cool! Our final TICKscript now contains everything we need. (see the full script below if you’d like to copy and paste the complete script). Let’s see it all in action!

Part three: Watch it all in action when the ML task detects abnormal data

Here’s how an alert looks in the PagerDuty application when the state is ‘Triggered’ and severity is set to “Warning”.

Alert status set to “Triggered” automatically in PagerDuty, thanks to Loud ML and InfluxData integration

Not all alerts are critical, and non-critical events, such as initial warnings, might not need any manual intervention at all. Rather than bog down your team with every alert, we can close these alerts automatically so team members can focus on the important alerts that need their attention.

This is what the same screen looks like few minutes later. The alert is displayed as “Resolved”. So what happened?

Alert status changed to “Resolved” automatically in PagerDuty, thanks to the Loud ML and InfluxData integration

Loud ML spots abnormal data, and normal data — and three things will happen in the latter case:

  • If the alert is neither critical nor a warning, its default Level is equal to “OK”.
  • We format the message to indicate that the situation is back to normal, using a Golang template. For more information, read about alerts in the Kapacitor documentation.
  • pagerDuty2() function internally sends a “resolve” event to PagerDuty which changes the state of the alert as we see it in the application.

What we’ve learned

We’ve learned how to use TICKscript and how to configure Kapacitor to trigger notifications to a given PagerDuty account.

We’ve seen how to use Loud ML and Kapacitor to automatically resolve alerts if there is no additional action needed.

It’s now possible to generalize what we’ve seen in this tutorial and create a design pattern. You will find a generic TICKscript template on Github using the link below. Values assigned to variables are no longer defined in the template, but left to the user. TICKscript templates give you the freedom to assign values (model name, database name, measurement name, etc.) without changing the core script design. More information about Kapacitor template tasks can be found in InfluxData’s documentation.

Link to template on Github.

And don’t forget, if you need more guidance, you can also watch how we did it during the DevOps.com webinar in early 2019.

BONUS: Use TICKscripts to trigger self-healing actions

But wait, there’s more! We can also automate the healing process if the alert level is critical. TICKscripts can execute arbitrary command lines, which is convenient in this situation to recover from a critical state.

Let’s use the exec node for this purpose:

var healing = data
|alert()
.crit(lambda: "score" > 90.0)
.exec('/bin/echo', 'Hello world')

You can pass the command name and arguments to the exec node, as required to achieve your self-healing needs.

In our next blog post, we’ll show you how self healing can be used (for example, to suspend account usage in the event of suspicious behavior). Follow us on Medium, or subscribe to our newsletter for more machine learning and automation stories.

Photo by Alex Simon on Unsplash