Volume Forecasting and Anomaly Detection using Fbprophet

Srinivas kulkarni
Apr 13 · 7 min read

A practical implementation of time series analysis using facebook’s opensource library fbprophet

This article is a follow-up of one of my earlier article Time Series Analysis — A Beginners Guide. If you are new to time series, I strongly recommend you to read through that article before this implementation example.

Installing fbprophet

I always like to have a different environment setup for each suite of projects. So go ahead and use the below commands to setup a new environment and install software's via anaconda prompt. Once the software’s are installed, open up the jupyter notebook and we are ready to go.

pip install jupyter
pip install pandas
pip install matplotlib
pip install seaborn
### Installing fbprophet
pip install pystan
conda install -c conda-forge fbprophet
### open jupyter notebook
jupyter notebook

Data set

The data set consists of sales data for the past 6 months. Given this data set, we have the task to forecast the sales for the next one month. I am not sharing the data set. But its pretty simple to create one I believe OR download from Kaggle.com. Here is the sample data set:

Sales Data Set

Loading the data set and EDA

Create a new python 3 notebook and import the required libraries.

#NOTE: Once you execute for the first time, you might get warning related to ploty, but execute again and the warning will disappear.

Next step is to load the data set and do exploratory data analysis (EDA) to fix any issues in the data. Note that I have intentionally taken a clean dataset so I don’t have to do much corrections. Since, EDA is not the key intention of this post.

# The Date column is not sorted in my case
df.head()
#Plot the sales data to visualize
df.plot()
# Date column needs to be explicitly defined as datetime
df.columns
# No null values detected
df.isnull().sum()
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values('Date', inplace=True)
#reset the index values post sorting
df.reset_index(drop=True, inplace=True)

White Noise Detection

Now that we have cleaned up the data set, The next step is to find out if the data set is suited for time series analysis OR not. If there is white noise, it is definitely best to stop here. You can refer to article white-noise-time-series-python to know more on how to detect white noise. The below code lines show steps I followed to detect the white noise

# check the data set stats
df.describe()
#Split the data set into 4 chunks, create a different data frame with mean and standard deviation of each data frame. Now plot to see if the is difference in mean and std.df_split = np.array_split(df, 4)
### We can see that mean is not constant and its fluctuating
df_stats = pd.DataFrame(np.array([[df['Sales'].mean(), df['Sales'].std()],
[df_split[0]['Sales'].mean(), df_split[0]['Sales'].std()],
[df_split[1]['Sales'].mean(), df_split[1]['Sales'].std()],
[df_split[2]['Sales'].mean(), df_split[2]['Sales'].std()],
[df_split[3]['Sales'].mean(), df_split[3]['Sales'].std()]]),
columns=['mean', 'std'])
df_stats.plot()

Below is the graph which shows that mean is not constant.

Mean and Standard deviation

Lets now use some plots to identify if the distribution is random. We can also plot a histogram to confirm if its distribution is Gaussian.

Below are the two graphs.

It does look like the volumes are high on couple of days of the week. Also weekend volumes seems to be much lower. The histogram doesn’t show a Gaussian distribution. We are good so far.

The next step is to find out auto-correlation. If auto-correlation suggests, previous data impacts the next day’s data, we can go ahead.

and below is the correlogram depiction showing the relation. As you can see, the distribution does indicate there is correlation with previous data. Sine waves like these seen in this example are a strong sign of seasonality in the dataset.

Auto-Correlation

With all this information, we are now good to proceed as our data is not white noise. Note that we need not worry about data being stationary, fbprophet can handle this kind of data.

Modelling

Data modeling is an important step. The different stages involved are:

  1. Column renaming — fbprophet expects the column names to be ‘ds’ and ‘y’
  2. Initializing the model
  3. Creating future dates
  4. Making predictions

The code to do all this is below:

# Initialize the model. fbprophet automatically detects its weekly seasonal. Note that model initialization may take time depending on data set size.model = Prophet()
model.fit(df)
#Run the below statement to know more on seasonality, period and mode.
model.seasonalities
# Create future dates of 30 days
future_dates = model.make_future_dataframe(periods=30)
#predict for future dates
prediction=model.predict(future_dates)
prediction.head()
prediction[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
# plot predicted projection
model.plot(prediction)

Within predictions, you will get multiple columns. But we are interested in mainly three of them

  1. yhat — The predicted value for that date
  2. yhat_upper — Upper bound value. Maximum volume of sales we expect on that date
  3. yhat_lower — Lower bound value. Minimum volume of sales we expect on that date

Its also important to understand the prediction plotted by last statement. Its explained below. The first one shows it for the full data set. The next one, is with lesser date range, I used to clearly show how to read this graph.

Cut down version of the data set

As you can see, most of the points lie within the range of yhat_lower and yhat_upper. Definitely there are some outliers, but for most of the data set the prediction seems to be satisfactory.

Anomaly Detection

The outliers in the data set tell us that something isn’t right on those days. These are the data points we should be worried about. Once we have the forecasting in place, We can use these forecast values of yhat_upper and yhat_lower as upper bound and lower bound values for future sales. If the values are not within the range on any given day, its a sign of warning that something isn’t right.

Its important to note that while Anomalous data can indicate software issues, critical incidents, etc., it may also be potential opportunities for instance a change in consumer behavior. So any deviation needs to be carefully studied and findings case differ on a case-to-case basis.

Cross Validation

Cross validation in time series works slightly different. The cross validation is to measure forecast error using historical data. The input to the cross validation is the model we built and below important parameters

  1. initial — how many days of data to consider for training for the purpose of cross validation
  2. period — The spacing between cut-off dates
  3. horizon — The forecast horizon.

By default, the initial training period is set to three times the horizon, and cutoffs are made every half a horizon.

Below we do cross-validation to measure prediction performance on a horizon of 14 days, starting with 42 days of training data in the first cutoff and then making predictions every 21 days. This brings out 6 total forecasts.

df_cv = cross_validation(model, initial='42 days', period='21 days', horizon = '14 days')df_cv
Cross Validation Data

As you can see, most of the time, actual value of y is within lower and upper bound and many cases, predicted value yhat is much closer to actual value y.

We can also see the performance metrics and plot them to understand how our model performed.

#plot the root mean squared error metrics
from fbprophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='rmse')

I found this library pretty useful and much more simpler that other forecasting libraries. Do check this out and let me know what you think about it. The document is shared by facebook is also useful. You can refer it with this link https://facebook.github.io/prophet/docs/quick_start.html.

Geek Culture

Proud to geek out. Follow to join our +1.5M monthly readers.