Building a real-time anomaly detection system for time series at Pinterest

Kevin Chen | Software Engineer Intern, Visibility
Brian Overstreet | Software Engineer, Visibility

In this post, we’ll share the algorithms and infrastructure that we developed to build a real-time, scalable anomaly detection system for Pinterest’s key operational timeseries metrics. Read on to hear about our learnings, lessons, and plans for the future.

Background

Pinterest uses an in-house metrics and dashboard system called Statsboard that allows teams to gain a real-time understanding of the state of their services. Using Statsboard, engineers can create alerts using a language that wraps common time series operations into a simple interface. These alerts include:

However, many of Pinterest’s top line growth and traffic metrics (ex: site requests, user logins) exhibit dynamic patterns that make it difficult to set rule-based alerts. Without an anomaly detection system capable of building a model to handle these dynamic metrics, users are faced with three undesirable alert situations:

Landscape of anomaly detection

There is a great deal of literature on anomaly detection out in the world — from open-source packages like Twitter’s AnomalyDetection or Linkedin’s Luminol, to academic works like Rob Hyndman’s papers on feature-based anomaly detection. We can classify the most popular of these time series anomaly detection techniques into four broad categories (which are neither mutually exclusive nor all encompassing):

In practice, we observed the best performance with decomposition models. Many machine learning models perform poorly for anomaly detection because of their tendency to overfit on the training data. For anomaly detection, we need to be able to distinguish and extract even the anomalies that we haven’t seen before. Decomposition models perform well at this task by explicitly removing correlated components (seasonality, trend) from the timeseries — giving us the opportunity to apply statistical tests with gaussian assumptions on the residuals whilst preserving the anomalies’ original attributes.

Seasonal trend decomposition on the ‘nottem’ dataset

Requirements of anomaly detection for observability

To build an anomaly detection system for observability, we have to keep a number of requirements in mind:

Teletraan, our in-house deploy system

Next, we’ll dive into the challenges we faced deploying our models in real-time.

How do we update our model in real-time?

Much of the literature on anomaly detection deals with finding retrospective anomalies in a static dataset i.e fitting a model with a nonlinear optimization procedure, such as LBFGS or Nelder-Mead, and then identifying anomalies within the training set.

However, if we want to find anomalies in real-time, training once is not enough — we have to continuously keep our model up to date to adapt to the latest behavior of our metric. Thus, we have to take one of four approaches to update our model parameters over time.

In our case, we found that the operational overhead of debugging scheduled or event driven systems outweighed their benefits. Instead, we use online updates whenever available, and brute force updates otherwise. To recast our algorithms into the online setting, we use Online Gradient Descent as an optimizer.

Visualization of gradient descent on a loss surface

Seasonality and trend estimation

The most commonly used form of decomposition today, called STL, makes use of a regression procedure called Loess, which repeatedly fits low-degree polynomials to subsets of the data. Unfortunately, this procedure is not scalable for real-time analytics.

In lieu of estimating seasonality with iterated Loess, we instead use a simple ensemble of efficient seasonality estimation techniques (e.g fourier, historical sampling). We also replace the trend component of STL with robust statistics viz. median, inspired by Twitter’s S-H-ESD anomaly detection algorithm.

Prediction intervals

The 68–95–99.7 rule gives a shorthand for the percentage of values around the mean in a normal distribution

Once we have our predictions, we can then use prediction intervals to determine whether a data point is an anomaly, based on its distance from our forecast. To do this, we can take advantage of the three sigma rule, which states that 99.73% of values in a normal distribution lie within three standard deviations of the mean. If an incoming data point is more than three standard deviations of the residual from our forecast, we can reasonably classify the point as an anomaly.

If the residual data is non-normal (e.g when the time series contains anomalies), we can use power transformations to attempt to recover a normal dataset. But if these transformations do not work, we can still resort to other measures — like sampling from the previous errors — to get an estimate of the uncertainty.

Once we’ve generated these intervals, we can translate them into visual bands on the UI side for ease of interpretation and tweaking by our end users.

Statsboard’s anomaly detection menu UI

Now, let’s take a look at the infrastructure that supports these algorithms.

Anomaly detection architecture

We have a forecasting server that is responsible for constructing one-step-ahead forecasts for Statsboard metrics in real-time and persisting them to our time series database (TSDB).

High level overview of forecasting server design

The forecasting server consists of a set of autoscaling workers and a server with a job queue and a scheduler that submits one-ahead forecast jobs every minute. For batch jobs, we pre-merge overlapping data window requests in order to minimize network and I/O. We integrate with our existing Statsboard UI and alerting infrastructure to provide anomaly detection builders, alerts, and dashboards.

Summary

Anomaly detection plays an important role in obtaining visibility for metrics that exhibit complex patterns that can’t be modeled by traditional alerts. Uncovered anomalies show up in real-time dashboards and alert the relevant users when something goes wrong.

Through the use of robust, scalable, and interpretable algorithms, our anomaly detection system helps engineers recognize and react to incidents as they happen, minimizing impact to the Pinterest business and our Pinners.

Acknowledgments

Dai Nguyen built the UI for this project. Special thanks to Naoman Abbas, Humsheen Geo, Colin Probasco, and Wei Zhu of the Visibility team as well as Yegor Gorshkov of the Traffic team for design recommendations and review.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store