Using Prophet for Anomaly Detection

Michael Duan
Seismic Innovation Labs
5 min readSep 13, 2019

--

Co-Authored by: Michael Duan and Cathy Xu

To start off this blog, we wanted to preface by saying we aren’t data science experts or anything like that. We’re just two summer interns who spent a portion of their time at Seismic trying to tackle anomaly detection. Here’s a short article about our progress on this topic and some potential points for further improvement.

Goal

Functional

Functionally, we wanted to leverage anomaly detection in order to look for evidence of bugs or system problems. If we predict a volume of activity today, how different is the actual to predicted value? And is the difference between actual and predicted value alarming? Would a dip in data on a certain day be caused by loss of data, failure of pipeline, or just because it’s Christmas Day? Being able to see trends in data loss and determine anomalous data were both important functional use cases to consider.

Operational

From an operational standpoint, we hoped to use anomaly detection to track overall volume over time and help forecast into the future as well. Given the prevalence of CICD in growing companies, ensuring proper data flow is crucial. Seismic holds multi-tenant data so being able to track usage or activity of the platform would allow engineers to predict load on the system and scale it accordingly.

Approach

We started off by doing a lot of research on the internet, and tried out a few different methods.

First we tried k-means clustering the data points, but this method only yielded us a set number of anomalous points proportional to the size of each data set. There are also only a few features in each data point, which may affect the clustering result. Then we attempted to use std mean and calculate anomalies based off of standard deviations, but this didn’t give us a consistent enough model for our problem. Finally, we came upon Prophet, an open source time series forecasting tool developed by Facebook’s data science team. We ended up using Prophet because it allowed us to address both our functional and operational goals.

A good article by Insaf Ashrapov about using Prophet for Anomaly Detection found here which gave us a solid basis on how to visualize the data and calculate importance.

Data

Each row in our data set contains a date and the accompanying activity count per day, for the activity of interest. You can find a lot of different data sets to test Prophet models here.

We read the data into a pandas Data frame and set the column names to be ‘ds’ and ‘y’ so that Prophet can take it in later.

df = pd.read_csv("example.csv", header = None)
df.columns = ['ds', 'y']

For our purposes we log the values to ensure the trends Prophet will forecast will not dip below 0. Due to this, we treated a value of 0 and 1 as the same (log(1) = 0). Prophet also provides a logistic growth model which may address the same issue.

#logs for non-zero value 
df['y'] = np.where(df['y'] != 0, np.log(df['y']), 0)
Some points from our data frame

Prophet

Prophet allows customization points that they mention in more detail through their paper.

Some key factors in Prophet to consider are: seasonality, holiday effects and regressors. These factors will vary depending on the specific data and context you are modeling, but below we provided an example of a Prophet model we used and the reasoning behind it.

m = Prophet(daily_seasonality = True, yearly_seasonality = False, weekly_seasonality = False, growth='linear',interval_width=0.8)m.add_country_holidays(country_name='US')m.add_regressor('weekend')

Seasonality

Prophet provides an easy way to change daily, weekly, and yearly seasonality. Depending on the context of your data, turning these flags to True and False will vary. We were tracking daily data so we used Prophet’s default daily_seasonality variable, which has a default fourier value of 10.

On top of the built in seasonality, you can also add custom seasonality. For example, quarterly seasonality might be an interesting aspect of the data to model for:

m = Prophet(daily_seasonality = True, yearly_seasonality = False, weekly_seasonality = False, growth='linear', interval_width = 0.8).add_seasonality(name='quarterly', period=365.25/4,                                fourier_order = i)

Fourier Order & Parameter Tuning

To tune custom seasonality we used sklearn’s R-squared score to decide what fourier_order would work best given the data.

Graph of r2 Score for fourier_orders 10 to 20

Based on the graph above, a fourier_order of 14 produced the best score.

Holidays

Holidays may affect data sets in many different ways, and Prophet has a built-in feature to add holidays based on country. Additionally, there is the option to add custom days, such as the Superbowl, into holiday considerations.

Source: Prophet Documentation

Regressors

We added a custom regressor to account for weekends, which are days we expected less data on. An important note we found was that adding an appropriate regressor can have a large impact on the accuracy of the Prophet model.

def weekend(ds):
ds = pd.to_datetime(ds)

if ds.weekday() == 6 or ds.weekday() == 5:
return 1
else:
return 0

df['weekend'] = df['ds'].apply(weekend)

Detecting Anomalies

We detect anomalies very similar to the article we listed above, with a few changes to how the ‘importance’ is calculated.

We calculated our importance based on the interval range, rather than just the difference between an upper or lower bound. This gives more context on how drastic an anomaly is as compared to the range of expected values.

Plotting Anomalies

Example Anomaly Detection Graph

The green portions of the graph represent the lower and upper bounds of the trend. Points outside of that are considered anomalies, and the size of the red dots are proportional to the ‘importance’ factor.

Anomaly data frame

Above is an example of our anomaly data frame. We inverse logged the values in order to see the predicted range, and the sign value of the ‘anomaly’ column represents if the data is lower or higher than we predicted.

Possible Improvements

Here are some of the possible improvements that can be made to our model

  • How to split the training, validation and test data
  • How far in advance to forecast
  • How often to train the model

All in all this was a great experience to delve into anomaly detection. Special thanks to our managers and the Seismic data science team for guiding us through the process!

--

--