A deep dive into Forecasting
In Part 1 forecasting was briefly introduced as an useful technique to establish future baselines for time series data. Although reading simple forecasts seems straight forward, there are a number of practical nuances which need to be understood for correctly consuming these forecasts. This post will cover some of the most important caveats of the forecasting process. Following the precedent set by last post, even this discussion will be light on the coding part and focus more on decluttering the concepts.
How does forecasting actually work?
Time series forecasting essentially involves two phases. First the training phase is used to learn the behavior of the time series in the past. Then the prediction phase applies this learned behavior to forecast future values.
Let’s try to understand what happens behind the scene using an example.
We have the monthly sales data for a store for the past 2 years. Let X =[x1, x2, x3, x4…., x24] represent our monthly sales values. We can need to forecast the sales for the next 6 months.
Training Phase — Involves learning a function F(X) that outputs a future value given all past sales values. This function can simply be thought of as the forecasting model.
x+1 = F(x, x-1, x-2, …x1)
Prediction Phase — The learned function can then be used to calculate future sales values as follows,
x25 = F(x24, x23, …..x1)
x26 = F(X25, x24, … x1) and so on…
The training phase is a classic optimization problem where we learn the model representing our time series.
How is the forecasting accuracy measured?
While building a forecast model the available data is generally divided into 2 sets — training set and validation/test set.
The training set is used to learn the model. The learned model is then used to forecast for the test set period. The accuracy is then calculated using the actual and forecasted values for the test set. There are a lot of scores used but for the sake of understanding, the accuracy can be calculated by using the Mean Absolute Percentage Error as follows,
The accuracy calculated on an unseen data gives an idea about how the model would perform in the real world on the future data. The plot below shows forecasting process for the Airline Sales dataset. The black line represents the actual data. The Orange line represents the learned model and the red line represents the forecasts on test period.
What is the forecast Prediction Interval?
The forecasts are almost always accompanied with a prediction interval. The prediction interval is calculated for each forecasted value which signifies the range (lower and upper bound)in which the actual value will lie within a said confidence percentage.
For instance recall our bike-sharing forecasts from the last post, the blue shaded region around the orange forecasts signified the 95% prediction interval. This means that based on the learned model it can be said that the actual future values will lie in the shaded region around the forecast with 95% confidence.
What are the different kinds of Models are used for Forecasting ?
As seen earlier, a time series has many components like trend, seasonality, irregular cycles, noise etc. Forecasting can be done using different models that learn these features in different ways. Therefore the quality of forecasts will also depend upon what kind of model is chosen and if it can capture the patterns in our data. While there are many types of models like Smoothening, ARIMA, Regression, Deep Learning based etc, this post will briefly discuss the 2 most widely used models — The Exponential Smoothening and ARIMA family of Models.
Forecasts produced using this method are simply weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight. For example, a forecast for time T+1 given data till time T can be imagined as follows where the α terms represents the coefficients learnt by model.
This is a very simple form of the model for developing the intuition and in real life the data will be more complex and we will use more complex equation of smoothening. For instance even Excel provides a variation of smoothening in its forecast functionality. Let’s look at the more specific Triple Exponential Smoothening or Holt Winters Seasonal method.
The Holt Winters method consists of the forecast equation with 3 components one for each level ℓt, trend bt and seasonality st. The model has the form as follows,
Without going into much details of actual smoothening equations and their coefficients, the key takeaway is that this method learns the trend, level and seasonality separately and then combines them to generate forecast. One point to note is that each smoothening component will have a corresponding smoothening coefficients α, β and γ in its equation. The smoothening coefficients are learnt by the model during the training process and need not be provided. Following is a simple use-case of this technique for predicting the sales for airline passengers
Holt Winters method is generally is a fast and simple method to use for forecasting whenever your data has trend, seasonality or some patterns and there isn't much data for other techniques like ARIMA to work.
Unlike smoothening, forecast generated by this method depends on auto-correlation between the time series. The ARIMA model is built on three components as follows,
AR model — Auto Regression (Order p)
Integrated Differencing (Order d)
MV model — Moving Average (Order q)
Thus the ARIMA model is described using these three order (p, d, q). For the seasonal variation of the model, the same concepts are extended to the seasonal period and we get the seasonal order for the model (P, D, Q)
What is Differencing?
A time series is called stationary of its values don't depend on time. For example, random noise can be called stationary.
On the other hand any series having a trend or seasonality is non-stationary and its values are said to have auto-correlation. For example, if we consider the temperature values for a city, todays value is correlated in some way to yesterdays value and so on.
We can remove auto-correlation and make a time series stationary by differencing wherein we substract current value of time series by past value.
There are statistical tests to determine the order d of differencing (how many times differencing needs to be applied to make the series stationary).
The ACF plot is used to visualize the auto-correlation between the values of a time series. Following figure shows auto-correlations for a google stock prices. The first plot shows the auto-correlations for the original series is high and gradually decreases for past values. The second plot shows Auto-correlation after differencing which is reduced and the series is almost stationary. To figure out the order of differencing we generally choose the lag value with highest auto-correlation which in first plot is 1 and in second plot is 7.
What is Auto-Regression?
Auto-regression is regression for predicting future value using the past values of the stationary time series itself. What this means is the current value of the series is regarded as the target and the lagged past values are regarded as the regressors to build the regression model. The regression equation has the following form and number of components (past values) considered to build is called the order p of the model.
What is Moving Average?
In the moving average model, the regression like model is build on the forecast errors of the past values of the stationary time series instead of the values themselves. The following is its equation form where it can be seen that each value y(t) in timeseries is expressed as the weighted moving average of the past forecast errors. The error terms included in the model determine the order q
Following code snippet uses a basic auto-forecast project based on the ARIMA family of models for forecasting the Bike Sharing forecasts seen above.
Lastly an important pre-processing step that's often needed in timeseries forecasting is making the timeseries have constant variance since most models wont handle the increasing/decreasing variance in data very well. This is done by applying transformations like box-cox or hyperbolic sine on the time series, building forecast models on transformed data, making predictions and then inverse transforming the predictions to the original scale.
By going through some widely used techniques and the internals of the forecasting process this post makes consuming forecasts a little easier. However there are still a lot of details that have been skipped to keep this discussion small.
When Holt-Winters Is Better Than Machine Learning
Machine Learning (ML) gets a lot of hype, but its classical predecessors are still immensely powerful, especially in…
Forecasting: Principles and Practice
Buy a print or downloadable version Welcome to our online textbook on forecasting. This textbook is intended to provide…