Time Series Forecasting: A Deep Dive

By Anuj Saboo, Ankita Kundra, Rishabh Jain

Published in

SFU Professional Computer Science

11 min readFeb 4, 2020

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit this link.

John is a hotel manager and is given the task for forecasting the room bookings for the next season so that the hotel can make staff and inventory available. He goes ahead and applies the following methods to measure the demand:

1. Average Method: This method assumes that what will happen tomorrow is the average of everything that has happened until now. John tries the method but soon discovers that the method is flawed because the hotel bookings are not constant throughout the year but have seasonal shifts especially an increase in bookings during the holiday season.

2. Moving Average Method: Looking for a better approach, John thought that instead of considering all the historical bookings, only taking last 12 months into account would yield better results. Though this improved the prediction due to a better estimate of demand, there is still no capture of seasonal variations.

John fails miserably in his task as he considers the future predictions to be nothing more than just averages. He did not know about the sophisticated time series predictive models to make the forecast for him. The improved methods available which we will discuss in this blog will help us better evaluate, learn from and forecast using time series forecasting.

Q. What is time series forecasting?

A time series is a set of observations taken at specified times usually at equal intervals. It is used to predict future values based on the previous observed values. Whether we wish to predict the trend in financial markets or forecast demand to maintain inventory, time is an important factor that must be considered in our models.

Q. Is forecasting same as prediction?

So far, we have been using forecasting and prediction interchangeably. In the English language, prediction and forecasting often mean the same thing. However, in analytics they differ on the basis of the objective they are trying to achieve. Prediction would apply to predict a response variable through other independent variables. On the other hand, forecasting would imply to use a given series of observations y₁, y₂,…, yᵢ and find the future values. Hence, forecasting is about time whereas prediction may or may not be about time.

Types of Data

Data can be classified into two major groups through their temporal nature:

Cross Sectional Data: Data is collected at a single point in time for one or more variables. Hence, it is not sequential and usually data points are independent of one another. Methods such as regression, random forests, neural networks etc. are applied on such datasets.
Time Series Data: Uni-variate or Multivariate data is observed across time in a sequential manner at a predetermined and equally spaced intervals(yearly/monthly/weekly etc.) Hence, ordering among data points is important for a time series data.
To forecast future values, we will need each measurement of data at regular time intervals which can be:

Components of Time Series

Let us consider the previous example of John where he was predicting demand based on previous bookings to make arrangements for the next season. The following components would help him understand the data behavior.

Trend: Movement of higher or lower values over long period of time. It can be of three types:

Uptrend: If the hotel continues to increase their bookings year over year
Horizontal Trend: If the hotel reaches its capacity then the bookings will trend sideways
Downtrend: If a competing hotel opens nearby, it could steal their bookings resulting in a sales decline

2. Seasonality: Upward or downward swings which get repeated within a fixed time period.

Above, we see high bookings during onset of winter where people would make reservations to take advantage of skiing and hiking. This pattern repeats year over year and hence we can say our time series contains seasonality.

3. Cyclic Pattern: If the upward or downward fluctuations are not within the fixed time period, then they are cyclical. The average length of cycles and magnitude of change is more in cyclical patterns compared to seasonality. These patterns are tough to predict since they are sudden events such as Stock crash.

4. Irregularity: An event occurring for a short duration which is random and non repeating. eg. Natural disasters like flood in a town can cause spike in medicine sales which drop after the disaster has receded as can be seen in the plot below.

Q. Can we apply time series to any time based dataset?

We cannot apply time series forecasting methods to a series which is white noise. Having white noise would imply being random in nature, having a mean of zero and a constant variance. Such a series will indicate no pattern and hence forecasting the future values will not be possible.

Note that in order to apply the time series forecasting models such as AR, MA, ARMA and ARIMA which we would discuss below, the prerequisite condition is that the time series should be stationary.

Stationarity where change is the only constant

Stationarity is an important characteristic of time series and is a necessary prerequisite to apply numerous forecasting models. A time series is said to be stationary if it has constant mean and variance, and covariance is independent of time. Of course, the time series we encounter in real world problems are not stationary but we can apply different transformations to make them stationary. There are different tests we can conduct to check for time series stationarity, some of which are:

Trend Plots
We take a look at the trend plots over time to see the performance for checking stationarity. In the below example, on the left we have the number of births which remain fairly constant over time whereas the air passenger traffic increases constantly.

2. Rolling Statistics

We can plot the rolling mean and standard deviation to check if it varies with time. If it remains constant, that would imply that we meet the stationary criterion but not otherwise.

Rolling Statistics plots to check Stationarity

3. Augmented Dickey-Fuller Test(ADF)

The above tests are not production proof. A reliable approach would be using ADF which is a statistical test that we can run to determine if a time series is stationary or not. The test defines the hypothesis as:

Null Hypothesis(Hₒ): the time series is not stationary. We check if the time series has a unit root which implies previous values of y provide no relevant information in predicting the change in y. If we fail to reject this, then time series is non stationary.
Alternative Hypothesis(Hₐ): If time series does not have a unit root, then previous values of y will provide relevant information in predicting the change in y. If we fail to reject this, then time series is stationary.

We conduct the ADF test and interpret the result using the p-value from the test. A p-value below a threshold (such as 5% or 1%) suggests that we reject the null hypothesis and a p-value above the threshold suggests that we fail to reject the null hypothesis.
Let us see the ADF test in action using the birth and air passengers dataset.

from statsmodels.tsa.stattools import adfuller
series = read_csv('data/daily-total-female-births.csv', header=0, index_col=0, squeeze=True)
X = series.values
result = adfuller(X) #adfuller is applied with parameter has values
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

Running the example prints the ADF test statistic as -4.808. The more negative this statistic, the more likely we are to reject the null hypothesis (hence accept that we have a stationary dataset).

ADF Statistic: -4.808291
p-value: 0.000052
Critical Values:
 1%: -3.449
 5%: -2.870
 10%: -2.571

From the output above, we can see that our statistic value of -4.808 is less than the value of -3.449 at 1%. This suggests that we can reject the null hypothesis with a significance level of less than 1% (i.e. a low probability that the result is a statistical fluke).

When executed for Airline Passengers data, it gives the following result. What are your conclusions on seeing the output below?

ADF Statistic: 0.815369
p-value: 0.991880
Critical Values:
 1%: -3.482
 5%: -2.884
 10%: -2.579

The high p-value and ADF statistic being greater than the critical value indicates the presence of stationarity.

Removing Non Stationarity

To remove non stationarity, we need to apply various transformation techniques such as log, square, square root etc. This is demonstrated in the jupyter notebook section Removing Non Stationarity.

Models

Since we have met the prerequisite condition of having a stationary dataset, let us move into discussing the forecasting models that can be applied.

Autoregressive(AR)

In an AR model, Y depends only on its past values. This can be represented with the below equation where e is the error.

Q. How many past values to use for computing Y?

The AR model has a parameter p and is represented as AR(p). The p value denotes how many past values have to be used. Based on the formula below, we can set a value of p and decide how many past intervals will be considered.

Changing the parameters ø results in a different time series pattern. The variance of the error term e(white noise) will only change the scale of the series, not the pattern. The section on ACF would explain how the best value of p can be found.

Moving Average(MA)

For every time interval Y we will get an error term e. This error term will follow a white noise process i.e. mean = 0 and constant variance.

In a moving average model, Y depends only on the error terms.

Q. How many past values to use for computing Y?

The MA model is accompanied with a parameter q and is represented as MA(q). Based on the formula below, we can set a value of q and decide how many past intervals will be considered.

Autoregressive Moving Average(ARMA)

This is a mix of both AR and MA models referred as ARMA(p,q). The general form of such a time series model, which depends on p of its own values and q of its past values takes the form:

Autoregressive Integrated Moving Average(ARIMA)

ARIMA model can be understood by outlining each of its components as :

Autoregressive(AR) : refers to a model that shows changing variables that regresses on its own lag or prior value as described above.
Integrated(I): refers to the differencing of the raw observations to make it stationary, i.e. the values are replaced by the difference between its current and past values.
Moving Average(MA): incorporates the dependency between an observation and a residual error from a moving average model applied to the lagged observations.

Hence, an ARIMA model is described as an “ARIMA(p,d,q)” model, where:

p: the number of lag observations in the model; also known as the lag order.
d: the number of times that the raw observations are differenced; also known as the degree of differencing.
q: the size of the moving average window; also known as the order of the moving average.

Let us get into details of finding the p, d and q parameters for ARIMA models.

Autocorrelation Function(ACF is used to find p)

The coefficient of correlation between two values in a time series is called the ACF. A lag 1 autocorrelation (i.e. p = 1) is the correlation between values that are one time period apart. More generally, a lag p autocorrelation is the correlation between values that are p time periods apart.

ACF would give us the value for p. The best value would be where the graph drops to 0 for the first time. It can be observed from the graph on the left that it touches 0 first at x = 2.
Hence p = 2.

Integrated(find d)

Differencing in statistics is a transformation applied to time-series data in order to make it stationary. In order to difference the data, the difference between consecutive observations is computed. Mathematically, this is shown as:

First order differencing

Differencing removes the changes in the level of a time series, eliminating trend and seasonality and consequently stabilizing the mean of the time series.

Sometimes it may be necessary to difference the data a second time to obtain a stationary time series, which is referred to as second order differencing:

Partial Autocorrelation Function (PACF is used to find q)

A partial autocorrelation is the amount of correlation between a variable and a lag of itself that is not explained by correlations at all lower order lags. The autocorrelation of a time series Y at lag 1 is the coefficient of correlation between it and it’s previous interval.

The value of t is called the lag which is a time gap being considered. A lag 1 autocorrelation (i.e. t = 1 in the above) is the correlation between values that are one time period apart. More generally, a lag t autocorrelation is the correlation between values that are t time periods apart.

PACF would give us the value for q. The best value would be where the graph drops to 0 for the first time. It can be observed from the graph on the left that it touches 0 first at x = 2.
Hence q = 2.

Forecasting using ARIMA

Having achieved the prerequisite conditions to fit ARIMA model, we are now in a position to evaluate the results. In the example provided in the jupyter notebook, we perform time series forecasting for Air Passengers dataset. The values for p,d,q are found to be 2,1,2 respectively and the ARIMA model is fit using the below code snippet.

from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(indexeddataset_log,order=(2,1,2))
#order being P,d,Q. If (2,1,0) hence Q = 0 hence AR modelresults_AR = model.fit(disp = -1)
plt.plot(datasetlog_shift)
plt.plot(results_AR.fittedvalues,color='Red')
plt.title("RSS: %.4f"%sum((results_AR.fittedvalues -  datasetlog_shift["#Passengers"])**2))

After fitting the ARIMA model, we re-transform the data to negate the effects of the transformations that were applied to make the data stationary. The details of these transformations can be found in the jupyter notebook. The final results of the forecast are:

Forecast Air Passenger Traffic using ARIMA

References

[1]http://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
[2] https://people.duke.edu/~rnau/411arim.htm
[3] https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/
[4] https://www.udacity.com/course/time-series-forecasting--ud980