We built a bot for anomalies

Here’s how Zapier alerts its teams

As an entirely remote team, Zapier relies on Slack to stay updated on projects, collaborate on new initiatives, and goof off like a virtual water cooler. Slack is where the bulk of our work happens. We work across a dozen time zones, so there is always someone on the clock. For that reason — and because we’re an automation company — bots play a huge role in sharing news internally and alerting the team to key metrics.

Supporting nearly 1,000 apps on Zapier’s platform means we have a lot of data. You probably do, too. You want your team to take action on the important data, while letting the rest settle into the background. In other words, you want to find anomalies.

AnomalyBot calls attention to out-of-ordinary metrics

AnomalyBot is one of our most important bots. Built by the data science organization, it alerts the right internal teams when data is outside of our expected trend. If signups or errors spike, someone needs to know. In this post, we’ll share how AnomalyBot works, so you can think about how you might build something similar for your company.

Make messages relevant to the current channel

A lot of messages flow through the Zapier Slack workspace every day. We value that everyone in the company has access to these public messages. Being transparent with the information we use to make decisions helps empower everyone to default to action.

While it’s important for everyone in our workspace to have access to most messages, they shouldn’t have to see every message. Out chattiest bots live in a series of Slack channels we prepend with “feed:”

Some of the #feed-channels in Zapier Slack

If anyone wants to know the latest deploys, error alerts, or signup counts, there’s a channel to check. That keeps each individual bot message out of sight, while giving team members immediate access when they want it.

Yet sometimes important messages might go unnoticed. Some messages are critical and must be seen. In those cases, you’ll want to post in a wider team channel and alert everybody. That’s where AnomalyBot comes in.

Find your anomalies with an AnomalyBot

Zapier’s data science team has pretty simple criteria for accurately spotting anomalies programmatically: something a human sees as out of the ordinary.

While easy to define, it’s harder to make data react like a human would. To reach “human decision” as a benchmark, you need scale on your side. That means lots of data.

Let’s assume you know the data you’d like to monitor (more on that later). Now you need to figure out how often to monitor it. The timeframe will depend on how often the data changes and how quickly you need to respond to it. Most will fall into hourly, daily, or weekly periods. To start, use something as simple as a cron job running at your required interval.

When your job runs, you’ll want to load enough of your data into memory to make a prediction. R and Python are popular languages for working through large datasets. Each has a lot of useful libraries for data scientists.

Often you’ll want to take the latest result and split it from the rest of the data. That’s the value we want to test for an anomaly. Take the rest of the data and apply one or both of these auto-forecasting algorithms:

  • The R library forecast, which has a useful function called auto.arima
  • The R and Python library Prophet, which Facebook built specifically for this purpose

Using either of these libraries, create a one step ahead prediction interval to get an expected range. If the data is volatile, the range will be wider. If the data is more predictable, the range will be narrower. You can tune this, though we look for 99.99% or more prediction intervals. Remember, we’re trying to only bubble up the most actionable data.

Here’s some sample code from Data Scientist Christopher Peters that uses Prophet to predict a lower and upper bounds from daily totals:

prophet_prediction <- function(y) {
  history <- data.frame(ds = seq.Date(
from = as.Date("1970-01-01"),
by = "day", length.out = length(y)), y = y)
  m <- prophet(history)
  future <- make_future_dataframe(m, periods = 1)
  forecast <- predict(m, future) %>% tail(1)
    lower = forecast$yhat_lower,
    upper = forecast$yhat_upper

All that is left at this point is to compare that latest data point you split off to the prediction. If the value falls within the expected range, do nothing. If it’s outside the range, you’ve found an anomaly! Time to let the right people know.

Share anomalies with the right team

Certain datapoints are often tied to specific teams. Their performance metrics may be tied to the numbers, or the outcome could otherwise impact their work.

For example, Zapier’s engineering teams certainly want to know about error rates on our site. Marketing wants to know about page views and signups. The developer platform team likes to see new app updates. The final step in making AnomalyBot useful is to make sure the right teams are notified.

Each team at Zapier has a primary Slack channel. AnomalyBot maps the category of data to Slack channels, which allows it to alert the right teams. We call a webhook with a chart, message, and the channel. Naturally, we use Zapier to pass this message on to Slack.

Like most teams, we try to minimize the noise of @channel alerts, but when there’s a true anomaly, nobody seems to mind the heads up. We’ve also implemented group handles, so we can use @group-marketing to reach that team without alerting the many others who collaborate with marketing.

We’ve gone a step further by making the automations that send AnomalyBot messages to Slack available in our shared folders within our own Zapier team account. That way, anyone at Zapier can capture the anomalies, store them, or be immediately notified by direct message.

Sometimes AnomalyBot is still too chatty, so we explored some possible solutions:

  • We may need to tune the detection algorithm or its input data to get more accurate results.
  • Sometimes anomalies are not actionable, such as error rates of someone else’s API. In those cases, they’re better off in a feed channel.
  • In some cases we can take action automatically, such as with our API downtime detection.

One of the ways we make sure we’re exposing the data that matters is by asking “know why?” Our rich Slack message allows for that to link to a Typeform, where team members share what might have caused the anomaly.

A Typeform captures possible explanations for the data anomaly

Along with this helping the data science team track their efforts, the anomaly reason helps annotate long-term charts to explain the spikes.

Since AnomalyBot has been running, it has alerted us to numerous bugs and kept us aware of the impact of new feature launches by reflecting graphs of anomalous data back to Slack. Often, mundane changes we make to the product result in anomalies. These alerts keep us aware of what others are shipping and the impact of those projects. With AnomalyBot delivering the most actionable data, we can better collaborate across teams and time zones.

Want more automation inspiration? Check out Zapier shared folders, which help teams automate anything, together.