๐๐ป๐๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐๐ผ ๐ง๐ถ๐บ๐ฒ ๐ฆ๐ฒ๐ฟ๐ถ๐ฒ๐: ๐ฃ๐ฎ๐ฟ๐ ๐ฎ.๐ฎ ๐ฆ๐๐ฅ๐๐ ๐ ๐ ๐ผ๐ฑ๐ฒ๐น๐
Recap :
Part 1: Introduction To Time Series Analysis -Part 1 Collab, Medium
Part 1.1: Introduction to Time Series Analysis โ Part 1.1: Stationarity Collab, Medium
Part 2: Introduction to Time Series: Part 2 Forecasting Collab, Medium
Part 2.1: Introduction to Time Series: Part 2.1 ARIMA Models Forecasting Collab Medium
In Previous Lab,
- we worked on Forecasting using ARIMA models. We first explored the limitations of AR models and introduced the Integration and Moving Average components (AR -> ARIMA) to enhance the modelโs performance.
- Covered various steps, including data loading, stationarity testing, lag selection using AIC, model fitting, and finally, predictions and visual comparisons between training and testing results.
- Developed an understanding of how to manipulate ARIMA models to forecast time series data effectively.
Intro
In this Lab,
- We will see that ARIMA models are not able to effectively model a Noisy time series if Lag Order << Period of Seasonality/Cyclicity.
- We will then discuss the SARIMA model and use SARIMA (Seasonal ARIMA) to forecast such time-series and discuss itโs effectiveness.
- We will discuss why Seasonal ARIMA can handle the long-period seasonal effects reasonably well.
Letโs Start with taking the time-series that we used in the last lab. Here is the equation:
# A Non-Linear Trend
trend = time*time * 0.001 # Non- Linear trend
# Seasonality with Time Period = 90
seasonality = 30 * np.sin(2 * np.pi * time / 90)
# Random Noise
noise = np.random.normal(0, 10, n) # Random noise
You will notice, that we have increased the Time-period to 90 days and the amplitude of the noise. Let us attempt to fit an ARIMA model by following the process we have highlighted in the previous labs:
- Stationarity Test: Using ADF tests, find if the data is stationary or not.
- Differencing: Find the degree of differencing for which the data is stationary.
- Parameter Tuning: Use the lag vs AIC curve to identify the best-suited Lag Order.
- Model Fit and Predict: Fit and predict using the ARIMA model and find the RMSE to measure accuracy.
Step 1: Create and Visualise the Time Series
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Number of observations
n = 200
# Time index
time = np.linspace(0, 500, n)
# Generate a time series with a linear trend and a seasonal component
np.random.seed(42)
trend = time*time * 0.001 # Non- Linear trend
seasonality = 30 * np.sin(2 * np.pi * time / 90) # Seasonal component with a period of 90
noise = np.random.normal(0, 10, n) # Random noise
# Combine components to form the time series
time_series = pd.Series(index = time, data = trend + seasonality + noise) # - cyclicity + noise)
# Convert to pandas series and plot
plt.figure(figsize = (15, 6))
time_series.plot(title='Time Series with Trend and Seasonality')
plt.show()
Step 2 : Check Data for Stationarity
# Step 1: Test for stationarity
# ADF Statistic is the output of the Augmented Dickey-Fuller test which tests for stationarity.
# p-value indicates the probability that the series has a unit root and is non-stationary.
# Critical values show the thresholds for the ADF statistic at which the null hypothesis can be rejected.
import warnings
from statsmodels.tsa.stattools import adfuller, kpss
warnings.filterwarnings('ignore')
def adf_test(time_series, differencing_order = None, verbose = True):
if differencing_order :
result = adfuller(time_series.diff(differencing_order).dropna())
else:
result = adfuller(time_series)
differencing_order = 0
adf_stats = result[0]
p_value = result[1]
if verbose :
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
if adf_stats < result[4]["5%"] :
print (f"The time-series at {differencing_order} is stationary")
return "Stationary"
else:
print (f"The time-series is {differencing_order} non-stationary")
return "Non-Stationary"
adf_test(time_series)
ADF Statistic: 0.464231
p-value: 0.983744
Critical Values:
1%: -3.466
5%: -2.877
10%: -2.575
The time-series is 0 non-stationary
Non-Stationary
Since the time-series is non-stationary, We will difference the time-series till differencing_order = 20. We will break the loop as soon as a differencing order is found where the time series is stationary
for d in range(1, 20):
if adf_test(time_series, d, verbose = False) == "Stationary":
break
The time-series at 1 is stationary
Since the data becomes stationary at differencing_order = 1, letโs take differenced time series and find the lag for which Auto-Regression will give best performance
from statsmodels.tsa.ar_model import AutoReg
test_lags = np.arange(1, 50)
test_aic = []
# Loop over possible lags from 0 to 20
for lag in test_lags: # including 0 to 20
model = AutoReg(time_series.diff(1).dropna(), lags=lag)
model_fit = model.fit()
aic = model_fit.aic
test_aic.append(aic)
plt.plot(test_lags, test_aic)
plt.xlabel("Lag Order")
plt.ylabel("AIC Values")
plt.grid()
The above graph doesnโt yield a definite Lag Order value for which we can train the ARIMA model. We will fit and predict several models and gauge the performance using Root Mean Square Errors.
import numpy as np
def calculate_rmse(actual, predictions):
"""
Calculate the Root Mean Squared Error between actual values and predictions.
Parameters:
actual (array-like): The actual values of the data.
predictions (array-like): The predicted values generated by the model.
Returns:
float: The calculated RMSE.
"""
actual = np.array(actual)
predictions = np.array(predictions)
rmse = np.sqrt(np.mean((actual - predictions) ** 2))
return rmse
def fit_predict_arima(order, train_series, test_series):
arima_model = ARIMA(train_series, order = order)
arima_result = arima_model.fit()
arima_predictions = arima_result.predict(start=len(train_series), end=len(train_series)+len(test_series)-1)
return arima_predictions
# Splitting data into training and testing
from statsmodels.tsa.arima.model import ARIMA
split = int(0.80 * len(time_series))
train_series = time_series.iloc[:split]
test_series = time_series.iloc[split:]
#fit_predict at different lags :
arima_predictions_lag2 = fit_predict_arima((2, 1, 2), train_series, test_series)
arima_predictions_lag5 = fit_predict_arima((5, 1, 5), train_series, test_series)
arima_predictions_lag10 = fit_predict_arima((10, 1, 10), train_series, test_series)
arima_predictions_lag20 = fit_predict_arima((20, 1, 20), train_series, test_series)
# Calculate RMSE for both models
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag2))
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag5))
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag10))
print("The RMSE in prediction of ARIMA model is ", calculate_rmse(test_series, arima_predictions_lag20))
plt.figure(figsize = (15,8 ))
train_series.plot(label = "Training Data")
test_series.plot(label = "Testing Data")
plt.plot(test_series.index, arima_predictions_lag2, label = "ARIMA Predictions at Lag 2")
plt.plot(test_series.index, arima_predictions_lag5, label = "ARIMA Predictions at Lag 5")
plt.plot(test_series.index, arima_predictions_lag10, label = "ARIMA Predictions at Lag 10")
plt.plot(test_series.index, arima_predictions_lag20, label = "ARIMA Predictions at Lag 20")
plt.grid()
plt.legend()
The RMSE in prediction of ARIMA model is 52.03602602852389 #(2, 1, 2)
The RMSE in prediction of ARIMA model is 67.32368981801054 #(5, 1, 5)
The RMSE in prediction of ARIMA model is 55.49322640110209 #(10, 1, 10)
The RMSE in prediction of ARIMA model is 57.47679848898671 #(20, 1, 20)
Remember that the Time-period of the seasonal part of the time-series is 90 days. An ARIMA model with a Lag-Order of 90 will be extremely complex (the model will try and optimize 90 regression coefficient, 90 moving-average coefficients). Therefore, we have to limit ourselves to something reasonable (< 30)
Arima Failure Observations
The following observations can be made about the results of the above training and predictions:
- ARIMA Predictions at Order (2,1,2): This model seems to be too simple. It does not capture the variability of the testing data well, likely because it doesnโt consider enough past information to make accurate predictions.
- ARIMA Predictions at Order (5,1,5): With a slightly higher lag, the model appears to perform better than the (2,1,2) model. However, it still doesnโt follow the testing data closely, suggesting it may still be too simple. Further, itโs modeling and capturing noise more than the actual pattern.
- ARIMA Predictions at (10, 1, 10): This model starts to capture the trend in the testing data somewhat better than the previous two. However, it doesnโt seem to adapt quickly to changes in the trend, which could suggest that while it has enough terms to start capturing the trend, it might be missing seasonality or other dynamic aspects in the data.
- ARIMA Predictions at (10, 1, 10): Interestingly, this model doesnโt seem to perform much better than the lag 10 model, and may even perform worse after a certain point. This could be due to the introduction of too much complexity, causing the model to react slowly to changes in the time series or to start overfitting.
Overall, one way to improve the model is to Better Account for Seasonality.
SARIMA Model
The SARIMA(p, d, q)(P, D, Q, s) model is characterized by several components:
- Seasonal Autoregression (SAR): This component captures the relationship between an observation and a number of lagged observations within the same season. Itโs denoted by the parameter : P, which represents the number of autoregressive terms for the seasonal component.
EXAMPLE : Imagine you run a shop that sells ice cream, and you notice that sales in June are often similar to sales the previous June. SAR in a forecasting model helps to predict this Juneโs sales by looking at the sales from Junes in previous years. The โPโ might tell us how many past Junes the model should look back at. - Seasonal Integration (SI): This component represents the differencing needed to make the time series stationary at the seasonal level. Itโs denoted by D, which represents the number of seasonal differences.
EXAMPLE : Suppose your ice cream sales have been increasing every summer because your townโs population grows a little each year. Seasonal differencing would subtract the sales of last June from this June to remove this increasing trend and help see the underlying pattern. โDโ tells us how many times we do this subtraction to stabilize the seasonal fluctuations. - Seasonal Moving Average (SMA): This component captures the relationship between an observation and a residual error from a moving average model applied to lagged observations within the same season. Itโs denoted by Q, which represents the number of moving average terms for the seasonal component.
EXAMPLE : Letโs say there was an exceptionally hot week last June that caused a spike in ice cream sales. This spike is not a regular thing, so we want our model to smooth it out. SMA helps to even out these random spikes by averaging them with errors (unexpected differences) from previous seasons. โQโ would tell us how many past errors to use for smoothing. - Autoregression (AR): Similar to ARIMA, this component captures the relationship between an observation and a number of lagged observations. Itโs denoted by p, which represents the number of autoregressive terms.
EXAMPLE : Imagine youโve recorded your ice cream sales every day for several years. You notice that if you had a lot of customers one day, itโs likely youโll have a good number the next day too, perhaps because of a heatwave. An AR model uses the sales from the past few days to predict the sales for the next day. If โpโ is 2, then to predict todayโs sales, the model would look at sales from the last two days. - Integration (I): This component represents the number of differences needed to make the time series stationary. Itโs denoted by d, which represents the number of non-seasonal differences.
EXAMPLE : Now, consider that your ice cream sales have been growing year over year because your townโs getting more popular. To predict future sales, you canโt just look at what you sold yesterday. You need to consider the overall upward trend in your sales. โdโ represents how many times you subtract previous sales from current sales to remove this growth trend and get a clearer picture of how sales change day by day. - Moving Average (MA): Similar to ARIMA, this component captures the relationship between an observation and a residual error from a moving average model applied to lagged observations. Itโs denoted by q, which represents the number of moving average terms.
EXAMPLE : On some days, a local event might cause a sudden increase in sales, or a rainy day might cause them to dip unexpectedly. These are like random good or bad days for your running friend. An MA model helps smooth out these random bumps in your ice cream sales by averaging them with the unusual increases or decreases in sales from the past. If โqโ is 3, the model would use the last three daysโ unexpected increases or decreases in sales to help predict todayโs sales. - Seasonal Period (s): This parameter specifies the length of the seasonal cycle in the data. For example, if the data is monthly and exhibits yearly seasonality, s would be 12.
EXAMPLE : Aside from daily sales, youโve also noticed that every July, the town fair brings in lots of tourists, and you sell more ice cream. This pattern repeats every year. If youโre looking at monthly sales data over several years, the seasonal period โsโ would be 12 to model this annual pattern. So the model will take into account that every 12th month (every July), thereโs a spike in sales due to the fair.
Letโs now demonstrate how to model a SARIMA model:
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Helper function to fit and predict using SARIMA model
def fit_predict_sarima(non_seasonal_order, seasonal_order, train_series, test_series):
# Fit SARIMA model
sarima_model = SARIMAX(train_series, order = non_seasonal_order, seasonal_order = seasonal_order)
sarima_result = sarima_model.fit()
# Predict on testing data
sarima_predictions = sarima_result.predict(start=len(train_series), end=len(train_series)+len(test_series)-1)
return sarima_predictions
# The 'order' argument (p, d, q) specifies that the model should use:
# - p past observations for autoregression (AR),
# - Differencing the series d times (I) to achieve stationarity,
# - q A moving average (MA) window of q past forecast errors to smooth out noise.
# The 'seasonal_order' (P, D, Q, s) indicates:
# - P past seasonal observation for seasonal autoregression (SAR),
# - Seasonal differencing D (SI) to remove seasonal trends,
# - A seasonal moving average (SMA) window of Q past seasonal forecast error,
# - and a seasonal cycle length of S time units.
# Parameters were chosen based on statistical analysis using PACF and ACF plots.
#fit_predict at different lags :
sarima_predictions = fit_predict_sarima((10, 1, 10),(1, 1, 1, 15), train_series, test_series)
sarima_predictions2 = fit_predict_sarima((10, 1, 10),(2, 1, 5, 15), train_series, test_series)
sarima_rmse1 = mean_squared_error(test_series, sarima_predictions, squared=False)
sarima_rmse2 = mean_squared_error(test_series, sarima_predictions2, squared=False)
plt.figure(figsize = (15,8 ))
train_series.plot(label = "Training Data")
test_series.plot(label = "Testing Data")
plt.plot(test_series.index, sarima_predictions, label = "SARIMA Predictions (10, 1, 10),(1, 1, 1, 15)")
plt.plot(test_series.index, sarima_predictions2, label = "SARIMA Predictions (10, 1, 10),(2, 1, 5, 15)")
plt.grid()
plt.legend()
Letโs evaluate the modelโs accuracy using RMSE and compare it to the SARIMA results
sarima_rmse1 = mean_squared_error(test_series, sarima_predictions, squared=False)
sarima_rmse2 = mean_squared_error(test_series, sarima_predictions2, squared=False)
# Calculate RMSE for both models
print("The RMSE in prediction of ARIMA model1 is ", calculate_rmse(test_series, arima_predictions_lag2))
print("The RMSE in prediction of ARIMA model2 is ", calculate_rmse(test_series, arima_predictions_lag5))
print("The RMSE in prediction of ARIMA model3 is ", calculate_rmse(test_series, arima_predictions_lag10))
print("The RMSE in prediction of ARIMA model4 is ", calculate_rmse(test_series, arima_predictions_lag20))
print("RMSE for (10, 1, 10),(1, 1, 1, 15) model : ", sarima_rmse1 )
print("RMSE for (10, 1, 10),(2, 1, 5, 15) model : ", sarima_rmse2 )
The RMSE in prediction of ARIMA model1 is 52.03602602852389
The RMSE in prediction of ARIMA model2 is 67.32368981801054
The RMSE in prediction of ARIMA model3 is 55.49322640110209
The RMSE in prediction of ARIMA model4 is 57.47679848898671
RMSE for (10, 1, 10),(1, 1, 1, 15) model : 24.937851474704146
RMSE for (10, 1, 10),(2, 1, 5, 15) model : 21.77717866602613
SARIMA Model Results:
Observations :
- The SARIMA model delivers more accurate forecasts by effectively capturing seasonal patterns in the data, which are missed by the non-seasonal ARIMA model.
- In the SARIMA predictions, the (10, 1, 10),(1, 1, 1, 15) model shows it adapts quickly to the dataโs regular cycles, suggesting its parameters are well-tuned for the seasonal trends occurring every 15 time units. THINK WHY?
- The ARIMA model, despite testing various lags, fails to align with the cyclical nature of the data, leading to predictions that drift away from actual values over time.
- SARIMAโs superior performance is due to its inclusion of seasonal differencing, which adjusts for repeating fluctuations and trends that are characteristic of the dataset.
- By accounting for seasonality, the SARIMA model provides a more nuanced and precise prediction, making it more reliable for this particular time series than the traditional ARIMA model.
Summary
The โIntroduction to Time Series: Part 2.2โ focuses on SARIMA (Seasonal ARIMA) models as an extension of ARIMA for handling time series data with significant seasonal patterns. The limitations of ARIMA models are discussed, particularly their inefficiency in capturing the seasonality of noisy time series with long periods. The lab introduces SARIMA models, which incorporate both non-seasonal and seasonal factors in the model to effectively forecast time series with distinct cyclical patterns.
The lab progresses by applying SARIMA models to a synthetic time series with a non-linear trend and a seasonal component of 90 days and fitting SARIMA models with different configurations. The results are compared using RMSE (Root Mean Square Error), demonstrating that SARIMA models provide more accurate predictions by effectively capturing the seasonal variations missed by ARIMA models. The tutorial emphasizes the importance of the seasonal components in SARIMA that adjust for recurring fluctuations, thus offering a more nuanced and precise prediction for seasonal time series data. This makes SARIMA a preferred model for scenarios where understanding and predicting seasonal patterns are crucial.