Time series forecasting — ARIMA and SARIMA

S. Do.
LatinXinAI
Published in
7 min readDec 11, 2023

Time series forecasting is one of the most useful (and complex) fields of Machine Learning. In this article, second part of the introduction to Time Series, I dive deeper into two of the most common forecasting models.

Barcelona at night — Photo by Pere Jurado — Unsplash

Autocorrelation

In the first part of the series, we saw that a basic data exploration can reveal trends, seasonality and other patterns in our time series. By differencing and decomposing the data, we saw that there’s a clear pattern repeating every 12 months (as expected).

Another way to detect seasonality is the Autocorrelation function (ACF). It quantifies the degree to which each observation is related to previous observations along different lags. We can set lags of different periods, like 6 months, 12, etc… It’s crucial for identifying the order of autoregressive (AR) and moving average (MA) processes, and helps a lot when selecting the appropriate parameters for ARIMA models (we’ll see them soon).

When we plot the ACF, we obviously see peaks every 12 months. This means that the series that are correlated with themselves shifted one year.

from statsmodels.tsa.stattools import acf, pacf

acf_df = pd.DataFrame({column:acf(df.loc[:,column], nlags=36)
for column in df.columns[1:]},)

The autocorrelation function has a sister: the Partial Autocorrelation Function (PACF). This function also shows the relationships between observations at different lags, but it excludes the effect that the intermediate observations have. For example, if we’re measuring the correlation at lag 5, the function represents the relation between the current observation point and the one that’s exactly 5 time points behind, excluding the ones that are 4, 3, 2 or 1 time point behind.

PACF is particularly useful to tune the order p in AutoRegressive models (AR(p)). This order p indicates how many past values are considered in the model. ARIMA models assume stationarity, so differencing is applied before computing the PACF to achieve stationarity, remove trends, and focus on the direct autocorrelations necessary for identifying the order of autoregressive terms in the time series model.

In the following plots, any spikes outside the greyed areas are statistically significant, meaning that there is a strong autocorrelation between values at those lags.

pacf_array = pacf(series.diff().dropna(), alpha=0.05) 
acf_array = acf(series.diff().dropna(), alpha=0.05)

Time series forecasting

ARIMA models

ARIMA stands for AutoRegressive Integrated Moving Average. It’s a combination of three different parts. The Autoregressive process models the current value of a variable as a linear combination of its past values. How many past values? That’s determined by the parameter p.

The Moving Average part, on the other hand, models the current value based on the average of its past values. Here the parameter to set its determined by the letter q:

Finally, the Integrated part corrects the model in case it is non-stationary. In our case, the rolling average plot clearly shows long-term trends, which makes our data non-stationary. But just looking at a plot to decide whether data is stationary or not is not very scientific, isn’t it? Luckily, the Augmented Dicker Fuller test was designed just for that. Its null hypothesis assumes that the data is non-stationary. For the three series (domestic, low and high voltage) the p value is high enough to accept that the data is non-stationary.

from statsmodels.tsa.stattools import adfuller

raw = [print(f"{column}: {adfuller(df.loc[:,column])[1]}") for column in df.columns[1:]]

## Output
domestic: 0.842010473579625
low_voltage: 0.3599026643228239
high_voltage: 0.09802548827010471

To “stationarise” a time series, we have to compute the differences, as we did above with the first-order differencing. The number of times we have to differentiate the data determines the last parameter d. We can use the Dickey Fuller test again to check how many times we need differencing. We see that with just one time, we get really low p-values in the test.

first_order = [print(f"First order differencing {column}: 
{adfuller(df.loc[:,column].diff().dropna())[1]}")
for column in df.columns[1:]]

## Output
First order differencing domestic: 1.346953258378036e-27
First order differencing low_voltage: 7.674326893256915e-20
First order differencing high_voltage: 4.517936348563829e-18

The model, then, can be written as ARIMA(p, d, q). We’ve already decided the value of d, now we need to decide p and q, so -> ARIMA(p, 1 , q). To determine the values p and q, we can take a look at the lags of the PACF (for p), and ACF (for q) plots. We can consider any peaks falling out of the significance area. For example, let’s take a closer look at the PACF plot corresponding to domestic consumption. Lags 1, 7, 8, 10 and 11 have a greater significance. Lags 6 and 9 too, but to a lesser extent. The ACF plot has a high peak at lag 12, revealing seasonality. This suggests that ARIMA may not be the best model to forecast the energy consumption, but let’s keep going with it, and we’ll take a look at Seasonal ARIMA (SARIMA) later.

As we see, there are a lot of possible combinations for p and q. We could arbitrarily choose one, or we can run a for loop between a range of possible combinations to select the one that throws the minimum MSE with the prediction. As with any machine learning model, we also have to split between train and test (we’re using an 80/20 ratio in this case). The code below summarizes the process.

from itertools import product
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error as mse

split_idx = df.index[int(len(df) * .8)]
domestic_train = df.loc[:split_idx, 'domestic']
domestic_test = df.loc[split_idx:, 'domestic']

p_range = range(1,12)
d_range = range(0,2)
q_range = range(1,13)

pdq_combinations = list(product(p_range, d_range, q_range))
combinations =[]
rmses = []

for pdq in pdq_combinations:
try:
model = ARIMA(domestic_train, order=pdq).fit()
prediction = model.predict(start=len(domestic_train), end=len(df))
rmse = np.sqrt(mse(domestic_test, prediction))
combinations.append(pdq)
rmses.append(rmse)
except Exception as e:
continue

results = pd.DataFrame({'combination': combinations, 'rmse': rmses})

As we see, even the best combinations give a quite high error, and looking at the plot we see that, although it captures some of the fluctuations, the prediction is not very accurate.

comb = results.loc[results.rmse.argmin(), 'combination']
error = results.rmse.min()
model = ARIMA(domestic_train, order=(4,0,6)).fit()
forecast = model.predict(len(domestic_train), len(df))

SARIMA models

In our case, it’s very clear that we have seasonality. We saw it when we first plotted the data, when we decomposed it, and finally with the ACF and PACF plots, with those clear peaks at lag 12. Seasonal AutoRegressive Integrated Moving Average (SARIMA) is an extension of the ARIMA model designed to capture both non-seasonal and seasonal patterns in time series data. The SARIMA model is denoted as SARIMA(p, d, q)(P, D, Q)m, where (p, d, q) represent the non-seasonal order (same as the ARIMA model), (P, D, Q) denotes the seasonal order, and m is the length of the seasonal cycle. The non-seasonal parameters (p, d, q) are similar to those in the ARIMA model, representing the autoregressive order, degree of differencing, and moving average order, respectively. The seasonal parameters (P, D, Q) mirror their non-seasonal counterparts but apply to the seasonal component.

The data we are looking at has a yearly seasonal component. Therefore, the m for us will be 12, as we have monthly data. Having a P of 1, for example, means that the model will evaluate each point based on an autoregression taking one seasonal cycle (12 months in our case) into account.

We’re not showing here the for loop to select the best parameters, to save some space.

model = SARIMAX(domestic_train, order=(8,0,6), seasonal_order=(1,1,1,12)).fit(disp=False)
forecast = model.get_forecast(steps=len(df) - len(domestic_train),)
predicted = forecast.predicted_mean
confidence_intervals = forecast.conf_int()

Conclusion

Time series forecasting is a very big field with applications in various domains, including finance, economics, and environmental science. This brief overview just shows key concepts such as trends, seasonality, and the ARIMA model. However, time series analysis has a multitude of advanced techniques and models, each designed for specific data characteristics and forecasting requirements.

This introduction is just a stepping stone, so feel free to explore the diverse landscape of time series forecasting and discover many more tools available for predicting future trends and patterns.

Further steps

Beyond the introductory concepts, we could explore other machine learning algorithms like Long Short-Term Memory (LSTM) networks or ensemble methods that can capture intricate patterns that traditional models may overlook.

Another thing that I really want to do is integrate economic metrics to study the influence that the economy has on energy consumption. By correlating time series data with economic indicators such as GDP, inflation rates, or interest rates, we could see other relationships and maybe we could use them to refine predictive models. But that’s material for many other articles!

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--