Introduction to Time Series: Part 2.1 ARIMA Models Forecasting

Published in

Time Series Using Python

10 min readApr 22, 2024

Recap :

Part 1: https://medium.com/time-series-using-python/introduction-to-time-series-analysis-part-1-3ed1ce9cac98

Part 1.1: https://medium.com/time-series-using-python/introduction-to-time-series-analysis-part-1-1-stationarity-0e239c9cc039

Part 2: https://medium.com/time-series-using-python/introduction-to-time-series-part-2-forecasting-ba0cdb9107a1

In Part 2 of the Introduction to Time-Series, we worked on :

Explanation of the Auto-Regressive Models as a type of time series model that uses the past values of a variable to predict its future values.
We performed tests on original and differenced time series to determine stationarity of the training data
We then demonstrated how to train and then fit AR(1) model. It was shown that AR(1) model was able to capture only the linear relationship (TREND) to some degree.
We then checked the performance of the model across different lag orders. We found that we can get best results at lag = 10. We chose lag = 10 and then use it to make a prediction. We saw that the model was able to make satisfactory prediction. Please note that we have till now not defined any metric to gauge the accuracy of the forecasts.

In this article, we will extend the discussion to ARIMA models. We will demonstrate the limitations of the AR model and explain the need of Moving Averages in Auto-Regressive Models. We will also see how we can measure the accuracy using mean-squared errors.

Need of MA in AR models

Limitations of AR models

The AR model assumes that the current value depends linearly on the past values (lags). Here are some questions that come to my mind :

What if the current value does not depend linearly on the past values?
What if the number of lags we have choosen (or we can afford to choose without making the model over complex) is not able to capture the complex dependencies on the past data?
what if the autocorrelation structure of the series is complex, such as having both short and long memory components or periodic autocorrelations that the AR part alone cannot capture.

We will examine some of these complexities by creating a time series as follows:

Have a low number of observations in the time series i.e. n = 200
Have a time series with a non-linear trend and a seasonal component with a large period
Add some Normally distributed noise to the data

from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Number of observations
n = 200

# Time index
time = np.linspace(0, 500, n)

# Generate a time series with a linear trend and a seasonal component
np.random.seed(42)
trend = time*time * 0.001 # Non- Linear trend
seasonality = 10 * np.sin(2 * np.pi * time / 50)  # Seasonal component with a period of 50
noise = np.random.normal(0, 2, n)  # Random noise

# Combine components to form the time series
time_series = pd.Series(index = time, data = trend + seasonality + noise)

# Convert to pandas series and plot
plt.figure(figsize = (15, 6))
time_series.plot(title='Time Series with Trend and Seasonality')
plt.show()

We will now proceed with testing the wave for stationarity, followed by finding the best lag to use for predictions

# Step 1: Test for stationarity
# ADF Statistic is the output of the Augmented Dickey-Fuller test which tests for stationarity.
# p-value indicates the probability that the series has a unit root and is non-stationary.
# Critical values show the thresholds for the ADF statistic at which the null hypothesis can be rejected.
import warnings
from statsmodels.tsa.stattools import adfuller, kpss

warnings.filterwarnings('ignore')

def adf_test(time_series, lag = None, verbose = True):
  if lag :
    result    = adfuller(time_series.diff(lag).dropna())
  else:
    result    = adfuller(time_series)
    lag = 0

  adf_stats = result[0]
  p_value   = result[1]
  if verbose :
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t%s: %.3f' % (key, value))

  if adf_stats < result[4]["5%"] :
    print (f"The time-series at {lag} is stationary")

  else:
      print (f"The time-series is {lag} non-stationary")

adf_test(time_series)

# Output:
'''
ADF Statistic: 7.498642
p-value: 1.000000
Critical Values:
 1%: -3.466
 5%: -2.877
 10%: -2.575
The time-series is 0 non-stationary
'''

# We will difference the time-series till lag = 20 (remember we have only 200 data points, so lag = 20 is also over complex model)
for lag in range(1, 20):
  adf_test(time_series, lag, verbose = False)

# Output :
'''
The time-series is 1 non-stationary
The time-series is 2 non-stationary
The time-series is 3 non-stationary
The time-series is 4 non-stationary
The time-series is 5 non-stationary
The time-series is 6 non-stationary
The time-series is 7 non-stationary
The time-series is 8 non-stationary
The time-series is 9 non-stationary
The time-series is 10 non-stationary
The time-series is 11 non-stationary
The time-series is 12 non-stationary
The time-series is 13 non-stationary
The time-series is 14 non-stationary
The time-series is 15 non-stationary
The time-series is 16 non-stationary
The time-series is 17 non-stationary
The time-series is 18 non-stationary
The time-series is 19 non-stationary
'''

From the above test results, it can be seen that the differenced time series (upto lag order = 20) remains non-stationary. This is because the time-series has a non-linear trend and has a seasonality.

# Since We are not able to make the time-series stationary within 5% confidence level,
# we will still try to model the time-series as an AR model at different lags and pick the best performing one
# Plot the results
from statsmodels.tsa.ar_model import AutoReg

test_lags =  np.arange(1, 50)
test_aic = []
# Loop over possible lags from 0 to 20
for lag in test_lags:  # including 0 to 20
    model     = AutoReg(time_series.diff(1).dropna(), lags=lag)
    model_fit = model.fit()
    aic       = model_fit.aic
    test_aic.append(aic)

plt.plot(test_lags, test_aic)
plt.xlabel("Lag Order")
plt.ylabel("AIC Values")
plt.grid()

Let’s use lag = 20 to make a prediction.

We will evaluate Root Mean Square Error as a measure of the model’s accuracy

import numpy as np

def calculate_rmse(actual, predictions):
    """
    Calculate the Root Mean Squared Error between actual values and predictions.

    Parameters:
    actual (array-like): The actual values of the data.
    predictions (array-like): The predicted values generated by the model.

    Returns:
    float: The calculated RMSE.
    """
    actual = np.array(actual)
    predictions = np.array(predictions)
    rmse = np.sqrt(np.mean((actual - predictions) ** 2))
    return rmse

# Splitting data into training and testing

split = int(0.8 * len(time_series))
train_series = time_series.iloc[:split]
test_series = time_series.iloc[split:]

# Fit AR model - assume an AR(10) to try capturing seasonality naively
ar_model = AutoReg(train_series.dropna(), lags = 20)
ar_result = ar_model.fit()

# Predict on testing data
ar_predictions = ar_result.predict(start=len(train_series), end=len(train_series)+len(test_series)-1, dynamic=False)

# Calculate RMSE for both models
print("The RMSE in predictions is ", calculate_rmse(test_series, ar_predictions))

# Output: The RMSE in predictions is  13.87282412661944
############################################################

plt.figure(figsize = (15,8 ))
train_series.plot(label = "Training Data")
test_series.plot(label = "Testing Data")
plt.plot(test_series.index, ar_predictions, label = "AR Predictions")
plt.legend()

AR Model Evaluation for the above complex non-linear time series:

We modeled the time-series with an Auto-Regressive (AR(20)) but with limitations:

We were not able to difference the time series and make it stationary. As a result, we have modeled a non-stationary time-series with an Auto_regressive model.
The model forecasts made by the AR model is able to capture some non-linear relationships between the current value and the past-values. However, it quickly diverges for higher time-points.

Can we improve the model by any chance? Answer is yes! AR model simply assumes a linear-relationship between current values and it’s past values. There’s nothing to account for errors due to sudden changes.

However, ARIMA models account for the errors due to sudden changes by adding up the moving average of the error terms (actual value at time t — prediction at time t).

Let’s see what ARIMA models are

ARIMA Model Overview:

ARIMA models enhance AR models with two additional elements:

Moving Average (MA) Component: Addresses prediction errors by incorporating past errors into the model, improving response to sudden changes in the data.
Integrated (I) Component: Differencing is applied to stabilize non-stationary data, preparing it for AR and MA modeling.

Let us now dive directly into fitting ARIMA model and see it’s performance for our example

from statsmodels.tsa.arima.model import ARIMA

# Fit ARIMA model - consider a model that allows for differencing (d=1) and a seasonal order
arima_model = ARIMA(train_series, order=(10, 1, 10))
arima_result = arima_model.fit()

# Predict on testing data
arima_predictions = arima_result.predict(start=len(train_series), end=len(train_series)+len(test_series)-1)

plt.figure(figsize = (15,8 ))
train_series.plot()
test_series.plot()
plt.plot(test_series.index, ar_predictions, label = "AR Predictions")
plt.plot(test_series.index, arima_predictions, label = "ARIMA Predictions")
plt.legend()

# Calculate RMSE for both models
print("The RMSE in predictions is ", calculate_rmse(test_series, arima_predictions))

# Output :
The RMSE in predictions is  5.336386796302866
The Average Amplitude of the Wave is  84.23939382217799

USING ARIMA, we were finally able to predict the non-linear time series satisfactorily. The Root Mean Squared Error has dropped by more than 50% (~13 -> 5.33)

WHY? The reason it worked was that ARIMA models account for “past errors” by including a Moving Average component of errors in it.

Let’s dive a little more into ARIMA model and try to work with a new dataset

Auto-Regressive Integrated Moving Average Models

Definition

The ARIMA model combines three parts: an Autoregressive (AR) part, Integration part and a Moving Average (MA) part. ARIMA models can directly handle non-stationary time-series due to the Integration(I) part. Each part addresses different properties of the time series:

Equation

Autoregressive (AR) part: This component models the current value of the time series as a linear combination of its previous values.

Yt=∑i=ipϕiYt−1

Integrated(I) Part : Helps in making the Time Series stationary by differencing the data d times to achieve stationarity.

Moving Average (MA) part: This component models the current value of the series as a linear combination of past error terms (the differences between past predictions and actual values). This allows the model to account for recent changes in the data (for example, abrupt change in amplitude height)

Yt=∑j=iqθjϵt−1

ARIMA (p,d,q) have three parameters :

p: Number of autoregressive terms (AR order).

d: Number of differences required to make the series stationary (I order).

q: Number of moving average terms (MA order).

#########################################################

Let’s use the Electricity Production dataset that we also used in our previous lab: https://medium.com/time-series-using-python/lab-1-case-study-time-series-analysis-using-electricity-production-data-02259f0823f5

ARIMA LAB

How to find p,d, q?

d can be found differencing data “d” times until its stationary (adf_fuller test)
find the AIC metric for a range of values for p and q and then using that set of value for modelling.

link — https://www.kaggle.com/datasets/shenba/time-series-datasets

Download the above data
Extract the data on local system
Upload the Electric_Production.csv to the notebook
Once uploaded, copy path

Step 1: EDA

import pandas as pd

# Load the dataset
file_path = '/content/Electric_Production.csv'
data = pd.read_csv(file_path)

# Display basic information about the dataset and the first few rows
print(data.info(), '\n')
print(data.describe(), '\n')
print(data.head(), '\n')

##########################Output##################################
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   DATE                    397 non-null    object 
 1   electricity_production  397 non-null    float64
dtypes: float64(1), object(1)
memory usage: 6.3+ KB
None 

       electricity_production
count              397.000000
mean                88.847218
std                 15.387834
min                 55.315100
25%                 77.105200
50%                 89.779500
75%                100.524400
max                129.404800 

         DATE  electricity_production
0  01-01-1985                 72.5052
1  02-01-1985                 70.6720
2  03-01-1985                 62.4502
3  04-01-1985                 57.4714
4  05-01-1985                 55.3151 
'''

import matplotlib.pyplot as plt

# Convert DATE from object to datetime and set as index
data['DATE'] = pd.to_datetime(data['DATE'], format='%m-%d-%Y')
data.set_index('DATE', inplace=True)

# Plot the time series
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['electricity_production'], label='Electricity Production', color='blue')
plt.title('Electricity Production Over Time')
plt.xlabel('Year')
plt.ylabel('Electricity Production')
plt.grid(True)
plt.legend()
plt.show()

STEP 2: Stationarity Test

# Lag = 0
adf_test(data, lag = 0)

######################Output####################################
'''
ADF Statistic: -2.256990
p-value: 0.186215
Critical Values:
 1%: -3.448
 5%: -2.869
 10%: -2.571
The time-series is 0 non-stationary
'''
##################################################################

# Lag = 1
adf_test(data, lag = 1)

######################Output####################################
'''
ADF Statistic: -7.104891
p-value: 0.000000
Critical Values:
 1%: -3.448
 5%: -2.869
 10%: -2.571
The time-series at 1 is stationary
'''
##################################################################

STEP 3: Choosing Lag for AR Part

test_lags =  np.arange(1, 50)
test_aic = []
# Loop over possible lags from 0 to 20
for lag in test_lags:  # including 0 to 20
    model     = AutoReg(data.diff(1).dropna(), lags=lag)
    model_fit = model.fit()
    aic       = model_fit.aic
    test_aic.append(aic)

plt.plot(test_lags, test_aic)
plt.xlabel("Lag Order")
plt.ylabel("AIC Values")
plt.grid()

STEP 4: Model Fitting

# Fit an ARIMA model

sample_size = int(0.8 * len(data))
train_data = data.iloc[0 : sample_size]
test_data  = data.iloc[sample_size : ]

arima_model = ARIMA(train_data['electricity_production'], order=(12,1,12)).fit()

# Predict the next 12 months
arima_pred = arima_model.predict(start=len(train_data), end=len(train_data)+len(test_data)-1)


# Calculate RMSE for the models
print("The RMSE in predictions is ", calculate_rmse(test_data['electricity_production'], arima_pred))

#####################################Output####################
The RMSE in predictions is  3.824256949921728

STEP 5: Plot Results

plt.figure(figsize = (15,8))
plt.plot(train_data.index, train_data.values, label = "Training Data")
plt.plot(test_data.index,  test_data.values, label = "Testing Data")
plt.plot(test_data.index, arima_pred, label = "ARIMA Predictions")
plt.legend()