ARIMA models for Time Series Analysis and Forecasting

Published in

CodeX

9 min readApr 18, 2022

Summary

ARIMA models are time series models often used in economics and finance to forecast the future. Here I talk a bit about ARIMA models in general and present an example of forecasting future populations gains of the US using ARIMA in Python.

The Jupyter notebook in which I perform the following analyses can be found on my GitHub here.

What is ARIMA, Exactly?

ARIMA is an acronym describing a method of combined statistical techniques to model and predict time series data. The acronym stands for AutoRegressive Integrated Moving Average. It combines the techniques used in both AutoRegression and Moving Averages in a generalized model.

AutoRegression is a process that takes advantage of a series being correlated with itself. In short, for time series data we typically have one variable that we’re working with, whereas in linear regression we establish a relationship between two variables. AutoRegression uses past values of a series to predict future values. The number of past values we use is denoted p.
Integrated refers to the technique of differencing the data to make the time series stationary. To difference the data we find the difference between a current value and the value at the time before it. The amount of times we need to difference the dataset to make it stationary is denoted d.
Moving Average is a model that takes into account a dataset’s mean and also current and past error terms to predict the future. The number of past error terms we use in an MA model is denoted q.

These three methods combine to form our ARIMA(p,d,q) model. I generally find the best way to learn something is by example (and then doing), so let’s look at an example.

The data we are looking at comes from the US Census Bureau on Kaggle found here. Loading the data into a pandas DataFrame and setting the date as the index it looks like this:

The first five rows of the population dataset.

To be honest it’s not quite clear what the numbers mean from looking at the description on Kaggle. I suspect these numbers either represent immigration to the US during the represented months or just general population growth/births. Either way, population data is easy to make a time series model out of. The numbers start in January 1952 and end in December 2019, giving us a total of 816 rows of data to work with. The values are incremented by month. A graph of the data over time looks like this.

Stationarity

Looking at the graph, there is a clear upward trend over time. However, for an ARIMA model we need our data to have something called stationarity, which was mentioned above. In order for a data set to be considered stationary it needs to have constant mean and standard deviation over time. From the above graph our data obviously does not satisfy this condition since it only increases over time.

One needs to apply transformations to the data in order for it to become stationary. Usually this is done by differencing the data. To difference the data we only need to find the difference between the value at the current time period and the value at the previous time period. This concept can be extended to find the second difference, third difference, etc. Applying the differencing method to the population data we find our data now looks like this:

First difference of the population data.

One could make the argument that this data is significantly more stationary than the data we had before. There is a more apparent mean, most likely around 200, although it is not constant. The standard deviation also seems to be smaller than that of the non-differenced data. But how can we tell if our data is differenced enough to use in an ARIMA model? One way is to use the Augmented Dickey-Fuller test, named after statisticians David Dickey and Wayne Fuller in 1979. The Augmented Dickey-Fuller test (ADF), in simple terms, has a null hypothesis that the data we test is non-stationary. Meaning if we can reject the null hypothesis our data has the necessary stationarity.

The Python statsmodels library has a built-in ADF which I use on the non-differenced and differenced data. Performing the ADF test on the non-differenced data yields a test statistic of -0.665 and a p-value of 0.855. At 95% confidence we need our p-value to be less than 0.05 for us to reject the null hypothesis. Thus we get the expected result that the original, unmodified data is non-stationary.

Performing the ADF test on the differenced data yields a test statistic of -2.01 and a p-value of 0.282. The p-value is still not low enough for us to reject the null hypothesis, although it is significantly lower than that of the non-differenced data. This suggests the data has some semblance of stationarity. Let’s difference the data a second time in the hopes we have a more stationary dataset to work with. Differencing a second time results in our data looking like:

Second difference of the population data.

The graph above is more telling of a dataset that is stationary. It has an obvious mean of 0 and doesn’t have any clear trends other than some spiking up and down. This suggests it has constant variance throughout. Furthermore, performing the ADF test on this transformed data yields a p-value that is near-zero (2.11e-10). With our p-value being less than 0.05 we have enough statistical evidence to suggest our data is now stationary. Based off of this finding we might consider using a value of d=2 in our ARIMA(p,d,q) model, since we difference the data two times for it to be stationary. To find values of p and q we need to look at the ACF and PACF plots for our data.

ACF and PACF Plots

The ‘p’ term in the ARIMA(p,d,q) model tells us how many time periods of lag we are going to use in our time series data. To find p we look at the plot of the Autocorrelation Function (ACF). The ACF plot tells us how correlated the values in a time series are with each other. On the y-axis we plot the correlation coefficient, and on the x-axis we plot the amount of time period lag. We have measures of the population for each month in our data, so one lag value is equal to one month in the past.

ACF Plots

Let’s look at the ACF plots of both the first- and second-differenced population data. The statsmodels package in Python has a built-in function to do this:

To clarify, a lag of 0 will always have a correlation of 1 on any ACF plot since the data points will always perfectly correlate with themselves.

Looking at the first-differenced ACF plot we can detect some seasonality in the population time series. The plot resembles a sinusoidal graph with another peak after every 12 periods of lag. This plot suggests that any value of p up to 25 would work well, as the data correlates with itself for various periods of lag. Although we need more information to narrow down exactly which value of p we should move forward with.

For the second-differenced ACF plot, looking at it suggests to me that our data is over-differenced. I say this because for a time period of 1 lag we immediately have a negative ACF value. A negative value on the ACF plot in this context means that if we were to have our population increasing at one point in time, it is highly likely that at the next period of time our population will be decreasing. Knowing the nature of our dataset this does not make sense since our population is always increasing. Thus looking at these ACF plots suggests using a value of d=1, rather than the d=2 we established before.

We are stuck in a middle ground here in terms of the first and second difference. While the first difference isn’t differenced enough and also displays some seasonality, the second-differenced data is over-differenced. We would prefer to use the data that is slightly under-differenced to the data that is over-differenced when trying to make predictions.

PACF Plots

The Partial Autocorrelation Function (PACF) plot displays the correlation between a variable and its lagged values after controlling the effects of other lagged variables. It captures the correlation between a series and its lagged values that aren’t already captured by other lagged variables plotted before.

This might help us narrow down what value of p to use in our model since we concluded that a lot of values could work from the first difference ACF plot.

PACF plots for the first- and second-differenced data.

From the above PACF plot for the first-differenced data we can say that only the first lagged value in the series has a significant correlation. This means that the previous month’s population has the most significant impact in predicting the current month’s population. Since the other values have much lower correlation values this suggests that the high values in the ACF plot can simply be explained be only the first lagged value. Thus we can try out a value of p=1 for our ARIMA(p,d,q) model.

Finding ‘q’

Finding the best value for q involves looking at points on the ACF and PACF plots that are the first to be out of the significant range. Based off of the PACF plot for the first-differenced data we should try a value of q=1 since that seems to be the last significantly correlated lag value. However, building the model is easy enough that we might as well try both q=1 and q=2.

Building the Models

Our findings suggest we try the following models:

I am going to train the models on the first-differenced data set, so in training I am going to set d=0 since the data will already be differenced when I feed it into the statsmodels API. Shown below are how the models fit the first-differenced data.

Fits of the ARIMA(1,1,1) and ARIMA(1,1,2) models.

To see how the two methods compare we can take a look at their error densities:

The density plots of the errors are very similar. It looks like the ARIMA(1,1,2) model might be more centered on 0 though, as the ARIMA(1,1,1) model looks like it has a bias to predict slightly higher than it should. Although I do not think it matters too much. I will arbitrarily use the ARIMA(1,1,1) model to make future predictions.

Making Predictions

I split the first 80% of the data into a training set for a new ARIMA(1,1,1) model and the latter 20% of the data will be used to make predictions and test on. The results can be seen in the plot below.

The dashed black line are the predicted population values, whereas the blue line represents the real population values. The gray area is a 95% confidence interval of the model’s predictions. On inspection the model does not perform too poorly at all. It captures the general trend, and the actual population is well within the prediction’s confidence interval.

Originally published at https://cbarger.com on April 18, 2022.