Time Series Modelling — ARIMA

Published in

Analytics Vidhya

7 min readMay 3, 2020

This article describes a complete end-to-end process of forecasting on a time series data using ARIMA model.

ARIMA (Auto Regressive Integrated Moving Average) is a category of models that explains a given time series based on its own past values. In other words, its own lags and the lagged forecast errors, so that the equation can be used to forecast future values.

AR — Auto Regression.
A model that uses the dependent relationship between itself and some lagged observations. This is basically a linear regression of time series onto itself. A parameter k represents the maximum lag. It is also known as the lag order.

The output depends only on its own lags and therefore, Yt is a function of the lags of Yt.

I — Integrated.
The use of differencing the observations. In other words, subtracting an observation from an observation at the previous time step. A parameter d represents the number of differences required to make the time-series stationary.

MA — Moving Average.
This model indicates that the future observation is the mean or average of all past observations. We can also consider that the moving average model corrects future forecasts based on errors made on recent forecasts.

The MA process of order q is defined as ,

where,

c = mean of the values
ϵt = change in value for one period. In other words, it is the error for that time period.
t-1, t-2, ….. t-q denotes order 1, 2, ….q respectively.
θ₁, θ2,….. θq denotes orders of integration. For this example, let us consider only order 1.

So, the MA equation becomes,

A parameter q represents the maximum lag after which other lags are not significant. It is also called the order of Moving Average or the Window size.
We apply Auto-Correlation Function plot (ACF) in order to find out the value of q.

Let’s start programming

For this article, I decided to choose the Foreign Exchange Rate dataset.

We import the dataset in our python code and plot the values to check whether the series is really non-stationary or not.

But firstly, we need to ensure that the data is converted into a time-series one.

Now we plot the data to visualize the stationarity manually.

The above clearly shows that the time-series data is not stationary, especially due to the presence of an increasing trend in the first half.

Resampling the Data

There are two types of Resampling:

Up-sampling — where we increase the frequency of the samples, such as from minutes to seconds.
Down-sampling — where we decrease the frequency of the samples, such as from days to months.

In our scenario, the original data is a daily data. So, we downsample the data to monthly data.

Now, before we proceed for modelling, we need to convert the data to a stationary one. In order to achieve that, we have to execute the procedure of differencing, or in other words, lagging the data, as described in my previous article.

We simultaneously keep on performing lagging and check the stationarity until the p-value reduces below 0.05.

The above plot results depicts that the p-value has reached below 0.05 and hence we can observe that the time-series data has somewhat become stationary now.

So, ts_monthly_log_diff becomes our series for modelling.

Finding out k-value

We apply Partial Auto-Correlation Function plot (PACF) in order to find out the value of k.

Why not Auto-Correlation Function (ACF) ?

We cannot use the ACF plot in AR model because it shows strong correlations even for the lags which are far in the past and hence might eventually lead to multi-collinearity issues. PACF plot avoids this by removing components that are already explained by past lags, so we only get the lags which have the correlation with the residual i.e the component not explained by past lags.

From the above plot, it is evident that the PACF graph cuts the x-axis at k value = 2. Therefore, the k-value becomes 2.

Finding out q-value

Order q is obtained from the ACF plot. This is the lag after which ACF crosses the upper confidence interval for the first time.

From the above plot, it is evident that the ACF graph cuts the upper confidence interval (blue region) at q value = 1.

Therefore, the q-value becomes 1.

Training the ARIMA model

Now we’re ready to build our first model. We have the following:

k = 2
d = 1
q = 1

Residual

We perform a Kernel Density Estimation (KDE) plot of the residual, which is almost a normal distribution and hence suggests that our predictions can be trusted.

Let’s make some Predictions

The above figure shows that the predictions have a gap in the data. If we see the data carefully, the prediction starts from January 1980, whereas in the original time-series, we have data starting from December 1979.

This has happened due to the shift that we did earlier, which created a blank or null data for the first value, as shown below.

So, we perform a cumulative sum of the shifted time series. We add the first value of the original time series to the cumulative sum.

This would eventually take us to the original structure of the data.

We now try to plot the graph containing both the original time series and predicted time series.

We also split our original time series into training and testing sets and we perform the same process of predictions on the testing data, to compare the predicted output with the actual or expected output.

Evaluating Performance

Since Time Series Analysis falls under the category of regression, due to the presence of Auto Regression, I chose to go with the MSE or Mean Squared Error to be the evaluation metric.

The Mean Squared Error is 0.0001 which is pretty less and the model is good for now.

Summary

Let us summarize the overall steps that we performed for this experiment.

Convert the imported dataset into a time-series one.
Perform a down-sampling or up-sampling, based on the frequency of the current data.
Perform Stationarity check and keep on lagging/differencing until p-value reduces down to a value less than 0.05.
Plot PACF and ACF graphs in order to find out k-value and q-value respectively.
Train the final lagged time-series data using ARIMA model and providing the k,d and q values as parameters.
Perform a cumulative sum on the predicted time-series in order to bring it to the original time-series format.
Perform predictions on the summed predicted time-series and evaluate the performance using MSE.

That’s all. Hope you really liked the article.