Python Code on ARIMA Forecasting

Etqad Khan
Jan 31 · 4 min read

Grab your coffee, we're going to code.

ARIMA is a Forecasting Technique and uses the past values of a series to forecast the values to come. A basic intuition about the algorithm can be developed by going through the blog post mentioned below that I wrote as Part 1 of my ARIMA exploration

The series I am using can be downloaded from here: https://drive.google.com/file/d/1W8K92lQ00Zt6J7qJnLKH4yp7MddIVBsR/view?usp=sharing

The first thing is to check for the stationarity in the data. The stationarity will be checked by the Augmented Dicky Fuller Test. The null hypothesis for this test is that the Time Series is non-Stationary. So, if the p-value is less than 0.05, we will reject the null hypothesis and believe that the series is Stationary.

Let's start by importing the library modules.

import pandas as pd
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt

Reading the csv and then plotting it to see how the trend looks like.

df = pd.read_csv(r'''shampoo_dataset.csv''')plt.plot(df.Month, df.Sales)
plt.xticks(rotation=90)

Let's look at the result at the ADF Test to understand the take on Stationarity.

result = adfuller(df.Sales.dropna())
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])

Now since the value is Significantly high, we will difference the series to attain Statonarity.

Finding Difference Order

We will plot the Autocorelation plot to understand the trends. If the ACF plot shows positive values for a significant number of lags, then it means that the series needs further differencing. On the other hand, if it goes into negative values from the first lag itself, we might have over-differenced it. So, let us plot the ACF plot for the differenced series. ACF is the correlation between current time step and the observation with previous time steps.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacffig, axes = plt.subplots(3, 2, sharex=True)
axes[0, 0].plot(df.Sales); axes[0, 0].set_title('Original Series')
plot_acf(df.Sales, ax=axes[0, 1])
# 1st Differencing
axes[1, 0].plot(df.Sales.diff()); axes[1, 0].set_title('1st Order Differencing')
plot_acf(df.Sales.diff().dropna(), ax=axes[1, 1])
# 2nd Differencing
axes[2, 0].plot(df.Sales.diff().diff()); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(df.Sales.diff().diff().dropna(), ax=axes[2, 1])
plt.show()

We can see that the differencing of order 1 is helping us make the series stationary, so let's choose d = 1.

Finding AR Term

We will look at the Partial Autocorelation Plots to understand the AR terms. The correlation between two time steps in a series such that they are correlated to other time steps. For example, today's weather depends on yesterday and yesterday's weather depends on the day after. So, the PACF of yesterday would be the correlation between today and yesterday after removing the influence of the day before yesterday.

fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df.Sales.diff()); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,5))
plot_pacf(df.Sales.diff().dropna(), ax=axes[1])
plt.show()

As we see, the lag is below the significant limit just immediately into the lags, so let us go ahead and put p = 1 for the sake of simplicity.

Finding MA Term

We will revisit the ACF terms to find the MA term. The value of MA term tells what value is needed to remove any autocorrelation in the series.

fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df.Sales.diff()); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,1.2))
plot_acf(df.Sales.diff().dropna(), ax=axes[1])

plt.show()

The lag order 1 is above the significant limit, but for lag order 2, it is fine. Let us choose the MA term and put q = 2.

Model Building

Let us build the model and analyse how well the values have translated into the model.

from statsmodels.tsa.arima_model import ARIMA
# ARIMA order (p,d,q)
model = ARIMA(df.Sales, order=(1,1,2))
model_fit = model.fit(disp=0)
print(model_fit.summary())

The model can be improved further and a lot of tuning can be done on it, but a point to note however is the small size of the series and thus limited accurate result.

Let us forecast and also look for accuracy,

model_fit.plot_predict(dynamic=False)
plt.show()

The results aren't satisfactory, but it's good to get an idea of how ARIMA works. Let's do a quick accuracy metric check on the same to see how well can the model forecast future values.

import numpy as npmodel = ARIMA(train, order=(1, 1, 2))  
fitted = model.fit(disp=-1)
# Forecast
fc, se, conf = fitted.forecast(6, alpha=0.05)
mape = np.mean(np.abs(fc - test)/np.abs(test)) # MAPE

The MAPE is 17.99, that means the model's accuracy is 82.11%.

I hope this tutorial gives a fair bit of idea on how to use ARIMA. We can work our way into the algorithm using the code present here. I would choose a better dataset next time. Thanks for reading, much appreciated!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…