Using Prophet for Anomaly Detection
Co-Authored by: Michael Duan and Cathy Xu
To start off this blog, we wanted to preface by saying we aren’t data science experts or anything like that. We’re just two summer interns who spent a portion of their time at Seismic trying to tackle anomaly detection. Here’s a short article about our progress on this topic and some potential points for further improvement.
Goal
Functional
Functionally, we wanted to leverage anomaly detection in order to look for evidence of bugs or system problems. If we predict a volume of activity today, how different is the actual to predicted value? And is the difference between actual and predicted value alarming? Would a dip in data on a certain day be caused by loss of data, failure of pipeline, or just because it’s Christmas Day? Being able to see trends in data loss and determine anomalous data were both important functional use cases to consider.
Operational
From an operational standpoint, we hoped to use anomaly detection to track overall volume over time and help forecast into the future as well. Given the prevalence of CICD in growing companies, ensuring proper data flow is crucial. Seismic holds multi-tenant data so being able to track usage or activity of the platform would allow engineers to predict load on the system and scale it accordingly.
Approach
We started off by doing a lot of research on the internet, and tried out a few different methods.
First we tried k-means clustering the data points, but this method only yielded us a set number of anomalous points proportional to the size of each data set. There are also only a few features in each data point, which may affect the clustering result. Then we attempted to use std mean and calculate anomalies based off of standard deviations, but this didn’t give us a consistent enough model for our problem. Finally, we came upon Prophet, an open source time series forecasting tool developed by Facebook’s data science team. We ended up using Prophet because it allowed us to address both our functional and operational goals.
A good article by Insaf Ashrapov about using Prophet for Anomaly Detection found here which gave us a solid basis on how to visualize the data and calculate importance.
Data
Each row in our data set contains a date and the accompanying activity count per day, for the activity of interest. You can find a lot of different data sets to test Prophet models here.
We read the data into a pandas Data frame and set the column names to be ‘ds’ and ‘y’ so that Prophet can take it in later.
df = pd.read_csv("example.csv", header = None)
df.columns = ['ds', 'y']
For our purposes we log the values to ensure the trends Prophet will forecast will not dip below 0. Due to this, we treated a value of 0 and 1 as the same (log(1) = 0). Prophet also provides a logistic growth model which may address the same issue.
#logs for non-zero value
df['y'] = np.where(df['y'] != 0, np.log(df['y']), 0)
Prophet
Prophet allows customization points that they mention in more detail through their paper.
Some key factors in Prophet to consider are: seasonality, holiday effects and regressors. These factors will vary depending on the specific data and context you are modeling, but below we provided an example of a Prophet model we used and the reasoning behind it.
m = Prophet(daily_seasonality = True, yearly_seasonality = False, weekly_seasonality = False, growth='linear',interval_width=0.8)m.add_country_holidays(country_name='US')m.add_regressor('weekend')
Seasonality
Prophet provides an easy way to change daily, weekly, and yearly seasonality. Depending on the context of your data, turning these flags to True and False will vary. We were tracking daily data so we used Prophet’s default daily_seasonality variable, which has a default fourier value of 10.
On top of the built in seasonality, you can also add custom seasonality. For example, quarterly seasonality might be an interesting aspect of the data to model for:
m = Prophet(daily_seasonality = True, yearly_seasonality = False, weekly_seasonality = False, growth='linear', interval_width = 0.8).add_seasonality(name='quarterly', period=365.25/4, fourier_order = i)
Fourier Order & Parameter Tuning
To tune custom seasonality we used sklearn’s R-squared score to decide what fourier_order would work best given the data.
Based on the graph above, a fourier_order of 14 produced the best score.
Holidays
Holidays may affect data sets in many different ways, and Prophet has a built-in feature to add holidays based on country. Additionally, there is the option to add custom days, such as the Superbowl, into holiday considerations.
Regressors
We added a custom regressor to account for weekends, which are days we expected less data on. An important note we found was that adding an appropriate regressor can have a large impact on the accuracy of the Prophet model.
def weekend(ds):
ds = pd.to_datetime(ds)
if ds.weekday() == 6 or ds.weekday() == 5:
return 1
else:
return 0
df['weekend'] = df['ds'].apply(weekend)
Detecting Anomalies
We detect anomalies very similar to the article we listed above, with a few changes to how the ‘importance’ is calculated.
We calculated our importance based on the interval range, rather than just the difference between an upper or lower bound. This gives more context on how drastic an anomaly is as compared to the range of expected values.
Plotting Anomalies
The green portions of the graph represent the lower and upper bounds of the trend. Points outside of that are considered anomalies, and the size of the red dots are proportional to the ‘importance’ factor.
Above is an example of our anomaly data frame. We inverse logged the values in order to see the predicted range, and the sign value of the ‘anomaly’ column represents if the data is lower or higher than we predicted.
Possible Improvements
Here are some of the possible improvements that can be made to our model
- How to split the training, validation and test data
- How far in advance to forecast
- How often to train the model
All in all this was a great experience to delve into anomaly detection. Special thanks to our managers and the Seismic data science team for guiding us through the process!