𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝘁𝗼 𝗧𝗶𝗺𝗲 𝗦𝗲𝗿𝗶𝗲𝘀: 𝗣𝗮𝗿𝘁 𝟮.𝟮 𝗦𝗔𝗥𝗜𝗠𝗔 𝗠𝗼𝗱𝗲𝗹𝘀

Dr. Vipul Joshi

Published in

Time Series Using Python

12 min readApr 29, 2024

Recap :

Part 1: Introduction To Time Series Analysis -Part 1 Collab, Medium

Part 1.1: Introduction to Time Series Analysis — Part 1.1: Stationarity Collab, Medium

Part 2: Introduction to Time Series: Part 2 Forecasting Collab, Medium

Part 2.1: Introduction to Time Series: Part 2.1 ARIMA Models Forecasting Collab Medium

In Previous Lab,

we worked on Forecasting using ARIMA models. We first explored the limitations of AR models and introduced the Integration and Moving Average components (AR -> ARIMA) to enhance the model’s performance.
Covered various steps, including data loading, stationarity testing, lag selection using AIC, model fitting, and finally, predictions and visual comparisons between training and testing results.
Developed an understanding of how to manipulate ARIMA models to forecast time series data effectively.

Intro

In this Lab,

We will see that ARIMA models are not able to effectively model a Noisy time series if Lag Order << Period of Seasonality/Cyclicity.
We will then discuss the SARIMA model and use SARIMA (Seasonal ARIMA) to forecast such time-series and discuss it’s effectiveness.
We will discuss why Seasonal ARIMA can handle the long-period seasonal effects reasonably well.

Let’s Start with taking the time-series that we used in the last lab. Here is the equation:

# A Non-Linear Trend
trend = time*time * 0.001 # Non- Linear trend

# Seasonality with Time Period = 90
seasonality = 30 * np.sin(2 * np.pi * time / 90)

# Random Noise
noise = np.random.normal(0, 10, n)  # Random noise

You will notice, that we have increased the Time-period to 90 days and the amplitude of the noise. Let us attempt to fit an ARIMA model by following the process we have highlighted in the previous labs:

Stationarity Test: Using ADF tests, find if the data is stationary or not.
Differencing: Find the degree of differencing for which the data is stationary.
Parameter Tuning: Use the lag vs AIC curve to identify the best-suited Lag Order.
Model Fit and Predict: Fit and predict using the ARIMA model and find the RMSE to measure accuracy.

Step 1: Create and Visualise the Time Series

from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Number of observations
n = 200

# Time index
time = np.linspace(0, 500, n)
# Generate a time series with a linear trend and a seasonal component
np.random.seed(42)
trend = time*time * 0.001 # Non- Linear trend
seasonality = 30 * np.sin(2 * np.pi * time / 90)  # Seasonal component with a period of 90
noise = np.random.normal(0, 10, n)  # Random noise

# Combine components to form the time series
time_series = pd.Series(index = time, data = trend + seasonality + noise) # - cyclicity + noise)

# Convert to pandas series and plot
plt.figure(figsize = (15, 6))
time_series.plot(title='Time Series with Trend and Seasonality')
plt.show()

A noisy time-series having a non-linear trend and a seasonal pattern with longer Time-period (90 days)

Step 2 : Check Data for Stationarity

# Step 1: Test for stationarity
# ADF Statistic is the output of the Augmented Dickey-Fuller test which tests for stationarity.
# p-value indicates the probability that the series has a unit root and is non-stationary.
# Critical values show the thresholds for the ADF statistic at which the null hypothesis can be rejected.

import warnings
from statsmodels.tsa.stattools import adfuller, kpss

warnings.filterwarnings('ignore')

def adf_test(time_series, differencing_order = None, verbose = True):
  if differencing_order :
    result    = adfuller(time_series.diff(differencing_order).dropna())
  else:
    result    = adfuller(time_series)
    differencing_order = 0

  adf_stats = result[0]
  p_value   = result[1]
  if verbose :
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t%s: %.3f' % (key, value))

  if adf_stats < result[4]["5%"] :
    print (f"The time-series at {differencing_order} is stationary")
    return "Stationary"

  else:
      print (f"The time-series is {differencing_order} non-stationary")
      return "Non-Stationary"

adf_test(time_series)

ADF Statistic: 0.464231
p-value: 0.983744
Critical Values:
 1%: -3.466
 5%: -2.877
 10%: -2.575
The time-series is 0 non-stationary
Non-Stationary

Since the time-series is non-stationary, We will difference the time-series till differencing_order = 20. We will break the loop as soon as a differencing order is found where the time series is stationary

for d in range(1, 20):
  if adf_test(time_series, d, verbose = False) == "Stationary":
    break

The time-series at 1 is stationary

Since the data becomes stationary at differencing_order = 1, let’s take differenced time series and find the lag for which Auto-Regression will give best performance

from statsmodels.tsa.ar_model import AutoReg

test_lags =  np.arange(1, 50)
test_aic = []
# Loop over possible lags from 0 to 20
for lag in test_lags:  # including 0 to 20
    model     = AutoReg(time_series.diff(1).dropna(), lags=lag)
    model_fit = model.fit()
    aic       = model_fit.aic
    test_aic.append(aic)

plt.plot(test_lags, test_aic)
plt.xlabel("Lag Order")
plt.ylabel("AIC Values")
plt.grid()

Variation of AIC values with Lag Order to identify the best Lag Order value for training an ARIMA model

The above graph doesn’t yield a definite Lag Order value for which we can train the ARIMA model. We will fit and predict several models and gauge the performance using Root Mean Square Errors.

import numpy as np

def calculate_rmse(actual, predictions):
    """
    Calculate the Root Mean Squared Error between actual values and predictions.

    Parameters:
    actual (array-like): The actual values of the data.
    predictions (array-like): The predicted values generated by the model.

    Returns:
    float: The calculated RMSE.
    """
    actual = np.array(actual)
    predictions = np.array(predictions)
    rmse = np.sqrt(np.mean((actual - predictions) ** 2))
    return rmse

def fit_predict_arima(order, train_series, test_series):
  arima_model       = ARIMA(train_series, order = order)
  arima_result      = arima_model.fit()
  arima_predictions = arima_result.predict(start=len(train_series), end=len(train_series)+len(test_series)-1)
  return arima_predictions

# Splitting data into training and testing
from statsmodels.tsa.arima.model import ARIMA

split = int(0.80 * len(time_series))
train_series = time_series.iloc[:split]
test_series = time_series.iloc[split:]

#fit_predict at different lags :
arima_predictions_lag2  = fit_predict_arima((2, 1, 2), train_series, test_series)
arima_predictions_lag5  = fit_predict_arima((5, 1, 5), train_series, test_series)
arima_predictions_lag10 = fit_predict_arima((10, 1, 10), train_series, test_series)
arima_predictions_lag20 = fit_predict_arima((20, 1, 20), train_series, test_series)


# Calculate RMSE for both models
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag2))
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag5))
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag10))
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag20))



plt.figure(figsize = (15,8 ))
train_series.plot(label = "Training Data")
test_series.plot(label = "Testing Data")
plt.plot(test_series.index, arima_predictions_lag2, label = "ARIMA Predictions at Lag 2")
plt.plot(test_series.index, arima_predictions_lag5, label = "ARIMA Predictions at Lag 5")
plt.plot(test_series.index, arima_predictions_lag10, label = "ARIMA Predictions at Lag 10")
plt.plot(test_series.index, arima_predictions_lag20, label = "ARIMA Predictions at Lag 20")

plt.grid()
plt.legend()

The RMSE in prediction of ARIMA model is  52.03602602852389 #(2, 1, 2)
The RMSE in prediction of ARIMA model is  67.32368981801054 #(5, 1, 5)
The RMSE in prediction of ARIMA model is  55.49322640110209 #(10, 1, 10)
The RMSE in prediction of ARIMA model is  57.47679848898671 #(20, 1, 20)

Remember that the Time-period of the seasonal part of the time-series is 90 days. An ARIMA model with a Lag-Order of 90 will be extremely complex (the model will try and optimize 90 regression coefficient, 90 moving-average coefficients). Therefore, we have to limit ourselves to something reasonable (< 30)

Forecasting Results using ARIMA models — (2, 1, 2), (5, 1, 5), (10, 1, 10) and (20, 1, 20).

Arima Failure Observations

The following observations can be made about the results of the above training and predictions:

ARIMA Predictions at Order (2,1,2): This model seems to be too simple. It does not capture the variability of the testing data well, likely because it doesn’t consider enough past information to make accurate predictions.
ARIMA Predictions at Order (5,1,5): With a slightly higher lag, the model appears to perform better than the (2,1,2) model. However, it still doesn’t follow the testing data closely, suggesting it may still be too simple. Further, it’s modeling and capturing noise more than the actual pattern.
ARIMA Predictions at (10, 1, 10): This model starts to capture the trend in the testing data somewhat better than the previous two. However, it doesn’t seem to adapt quickly to changes in the trend, which could suggest that while it has enough terms to start capturing the trend, it might be missing seasonality or other dynamic aspects in the data.
ARIMA Predictions at (10, 1, 10): Interestingly, this model doesn’t seem to perform much better than the lag 10 model, and may even perform worse after a certain point. This could be due to the introduction of too much complexity, causing the model to react slowly to changes in the time series or to start overfitting.

Overall, one way to improve the model is to Better Account for Seasonality.

SARIMA Model

The SARIMA(p, d, q)(P, D, Q, s) model is characterized by several components:

Seasonal Autoregression (SAR): This component captures the relationship between an observation and a number of lagged observations within the same season. It’s denoted by the parameter : P, which represents the number of autoregressive terms for the seasonal component.
EXAMPLE : Imagine you run a shop that sells ice cream, and you notice that sales in June are often similar to sales the previous June. SAR in a forecasting model helps to predict this June’s sales by looking at the sales from Junes in previous years. The ‘P’ might tell us how many past Junes the model should look back at.
Seasonal Integration (SI): This component represents the differencing needed to make the time series stationary at the seasonal level. It’s denoted by D, which represents the number of seasonal differences.
EXAMPLE : Suppose your ice cream sales have been increasing every summer because your town’s population grows a little each year. Seasonal differencing would subtract the sales of last June from this June to remove this increasing trend and help see the underlying pattern. ‘D’ tells us how many times we do this subtraction to stabilize the seasonal fluctuations.
Seasonal Moving Average (SMA): This component captures the relationship between an observation and a residual error from a moving average model applied to lagged observations within the same season. It’s denoted by Q, which represents the number of moving average terms for the seasonal component.
EXAMPLE : Let’s say there was an exceptionally hot week last June that caused a spike in ice cream sales. This spike is not a regular thing, so we want our model to smooth it out. SMA helps to even out these random spikes by averaging them with errors (unexpected differences) from previous seasons. ‘Q’ would tell us how many past errors to use for smoothing.
Autoregression (AR): Similar to ARIMA, this component captures the relationship between an observation and a number of lagged observations. It’s denoted by p, which represents the number of autoregressive terms.
EXAMPLE : Imagine you’ve recorded your ice cream sales every day for several years. You notice that if you had a lot of customers one day, it’s likely you’ll have a good number the next day too, perhaps because of a heatwave. An AR model uses the sales from the past few days to predict the sales for the next day. If ‘p’ is 2, then to predict today’s sales, the model would look at sales from the last two days.
Integration (I): This component represents the number of differences needed to make the time series stationary. It’s denoted by d, which represents the number of non-seasonal differences.
EXAMPLE : Now, consider that your ice cream sales have been growing year over year because your town’s getting more popular. To predict future sales, you can’t just look at what you sold yesterday. You need to consider the overall upward trend in your sales. ‘d’ represents how many times you subtract previous sales from current sales to remove this growth trend and get a clearer picture of how sales change day by day.
Moving Average (MA): Similar to ARIMA, this component captures the relationship between an observation and a residual error from a moving average model applied to lagged observations. It’s denoted by q, which represents the number of moving average terms.
EXAMPLE : On some days, a local event might cause a sudden increase in sales, or a rainy day might cause them to dip unexpectedly. These are like random good or bad days for your running friend. An MA model helps smooth out these random bumps in your ice cream sales by averaging them with the unusual increases or decreases in sales from the past. If ‘q’ is 3, the model would use the last three days’ unexpected increases or decreases in sales to help predict today’s sales.
Seasonal Period (s): This parameter specifies the length of the seasonal cycle in the data. For example, if the data is monthly and exhibits yearly seasonality, s would be 12.
EXAMPLE : Aside from daily sales, you’ve also noticed that every July, the town fair brings in lots of tourists, and you sell more ice cream. This pattern repeats every year. If you’re looking at monthly sales data over several years, the seasonal period ‘s’ would be 12 to model this annual pattern. So the model will take into account that every 12th month (every July), there’s a spike in sales due to the fair.

Let’s now demonstrate how to model a SARIMA model:

from statsmodels.tsa.statespace.sarimax import SARIMAX

# Helper function to fit and predict using SARIMA model
def fit_predict_sarima(non_seasonal_order, seasonal_order, train_series, test_series):
  # Fit SARIMA model
  sarima_model = SARIMAX(train_series, order = non_seasonal_order, seasonal_order = seasonal_order)
  sarima_result = sarima_model.fit()

  # Predict on testing data
  sarima_predictions = sarima_result.predict(start=len(train_series), end=len(train_series)+len(test_series)-1)
  return sarima_predictions

# The 'order' argument (p, d, q) specifies that the model should use:
# - p past observations for autoregression (AR),
# - Differencing the series d times (I) to achieve stationarity,
# - q A moving average (MA) window of q past forecast errors to smooth out noise.

# The 'seasonal_order' (P, D, Q, s) indicates:
# - P past seasonal observation for seasonal autoregression (SAR),
# - Seasonal differencing D (SI) to remove seasonal trends,
# - A seasonal moving average (SMA) window of Q past seasonal forecast error,
# - and a seasonal cycle length of S time units.
# Parameters were chosen based on statistical analysis using PACF and ACF plots.

#fit_predict at different lags :
sarima_predictions  = fit_predict_sarima((10, 1, 10),(1, 1, 1, 15), train_series, test_series)
sarima_predictions2 = fit_predict_sarima((10, 1, 10),(2, 1, 5, 15), train_series, test_series)

sarima_rmse1 = mean_squared_error(test_series, sarima_predictions, squared=False)
sarima_rmse2 = mean_squared_error(test_series, sarima_predictions2, squared=False)



plt.figure(figsize = (15,8 ))
train_series.plot(label = "Training Data")
test_series.plot(label = "Testing Data")

plt.plot(test_series.index, sarima_predictions, label = "SARIMA Predictions (10, 1, 10),(1, 1, 1, 15)")
plt.plot(test_series.index, sarima_predictions2, label = "SARIMA Predictions (10, 1, 10),(2, 1, 5, 15)")


plt.grid()
plt.legend()

SARIMA model results for (10, 1, 10),(1, 1, 1, 15) & (10, 1, 10),(2, 1, 5, 15). Both the models are able to understand the cyclicity (long term patterns) and the non-linearity of the trend.

Let’s evaluate the model’s accuracy using RMSE and compare it to the SARIMA results

sarima_rmse1 = mean_squared_error(test_series, sarima_predictions, squared=False)
sarima_rmse2 = mean_squared_error(test_series, sarima_predictions2, squared=False)
# Calculate RMSE for both models
print("The RMSE in prediction of ARIMA model1 is ", calculate_rmse(test_series, arima_predictions_lag2))
print("The RMSE in prediction of ARIMA model2 is ", calculate_rmse(test_series, arima_predictions_lag5))
print("The RMSE in prediction of ARIMA model3 is ", calculate_rmse(test_series, arima_predictions_lag10))
print("The RMSE in prediction of ARIMA model4 is ", calculate_rmse(test_series, arima_predictions_lag20))
print("RMSE for (10, 1, 10),(1, 1, 1, 15) model : ", sarima_rmse1 )
print("RMSE for (10, 1, 10),(2, 1, 5, 15) model : ", sarima_rmse2 )

The RMSE in prediction of ARIMA model1 is  52.03602602852389
The RMSE in prediction of ARIMA model2 is  67.32368981801054
The RMSE in prediction of ARIMA model3 is  55.49322640110209
The RMSE in prediction of ARIMA model4 is  57.47679848898671
RMSE for (10, 1, 10),(1, 1, 1, 15) model :  24.937851474704146
RMSE for (10, 1, 10),(2, 1, 5, 15) model :  21.77717866602613

SARIMA Model Results:

Observations :

The SARIMA model delivers more accurate forecasts by effectively capturing seasonal patterns in the data, which are missed by the non-seasonal ARIMA model.
In the SARIMA predictions, the (10, 1, 10),(1, 1, 1, 15) model shows it adapts quickly to the data’s regular cycles, suggesting its parameters are well-tuned for the seasonal trends occurring every 15 time units. THINK WHY?
The ARIMA model, despite testing various lags, fails to align with the cyclical nature of the data, leading to predictions that drift away from actual values over time.
SARIMA’s superior performance is due to its inclusion of seasonal differencing, which adjusts for repeating fluctuations and trends that are characteristic of the dataset.
By accounting for seasonality, the SARIMA model provides a more nuanced and precise prediction, making it more reliable for this particular time series than the traditional ARIMA model.

Summary

The “Introduction to Time Series: Part 2.2” focuses on SARIMA (Seasonal ARIMA) models as an extension of ARIMA for handling time series data with significant seasonal patterns. The limitations of ARIMA models are discussed, particularly their inefficiency in capturing the seasonality of noisy time series with long periods. The lab introduces SARIMA models, which incorporate both non-seasonal and seasonal factors in the model to effectively forecast time series with distinct cyclical patterns.

The lab progresses by applying SARIMA models to a synthetic time series with a non-linear trend and a seasonal component of 90 days and fitting SARIMA models with different configurations. The results are compared using RMSE (Root Mean Square Error), demonstrating that SARIMA models provide more accurate predictions by effectively capturing the seasonal variations missed by ARIMA models. The tutorial emphasizes the importance of the seasonal components in SARIMA that adjust for recurring fluctuations, thus offering a more nuanced and precise prediction for seasonal time series data. This makes SARIMA a preferred model for scenarios where understanding and predicting seasonal patterns are crucial.