2- Econometric & Statistical Models in Time Series

Published in

Analytics Vidhya

9 min readMay 3, 2021

Update: This article is part of a series in which I explored the time series. Check out the full series: Part 1, Part 2, Part 3.

Introduction

I have monthly average temperatures from Florida since 1743, which I discussed in the previous article. In this article, firstly I will explain basically some models and their definition and show you basic operations in time series using models such as ARIMA, SARIMA with this dataset.

Don’t explain to me, show the code

You can access the GitHub repo here.

AR(p) : Autoregression

Estimation is made by a linear combination of observations from previous time steps.
It is suitable for univariate time series without trend or seasonality.
p: time delay. If p = 1, it means that the model was established with the previous time step.

MA(q): Moving Average

Estimation is done with a linear combination of errors obtained in previous time steps.
It is suitable for univariate time series without trend or seasonality.
q: is the time delay number.

ARMA(p,q) = AR(p) +MA(q) Autoregressive Moving Average

Combines AR and MA methods.
Estimation is made with a linear combination of historical values and past errors.
It is suitable for univariate time series without trend or seasonality.
p and q are time-delay numbers. p for AR model q for MA model.

ARIMA(p, d, q) : (Autoregressive Integrated Moving Average)

ARIMA models are used to review time series data and make predictions for the future. Common usage is Arima (p, d, q). p is used for the autoregressive (ar (p)) part, and d indicates how many times the data series differ. (the new series obtained by subtracting each data from the previous data) Finally, q is used for the moving average (ma (q)) part. In this case, Arima (1,0,0) and ar (1) are the same.
arima (0,0,1) and ma (1) are the same.

If we want to summarize step by step:

Estimation is made by a linear combination of observations and errors that differed from previous time steps.
It is suitable for the univariate, trend but not seasonal data.
p: actual value delay number (autoregressive degree)
- If p = 2, it is in yt-1 and yt-2 models.
d: number of different operations (degree of difference, I)
q: error delay number (moving average degree)
What does it mean to be noticed?
- For example, it is to subtract the values of today and the previous day from each other.
What does the spotting process do, why do we get noticed?
- To stabilize the series.
What does it mean to stabilize the series?
- To eliminate the situation that the statistical properties of the series change with time.
As it can be understood, the process of taking the difference here has created the opportunity to make predictions in the series that are now in trend.

SARIMA (p, d, q) (P, D, Q)m: (Seasonal Autoregressive Integrated Moving-Average)

It is a method used especially in time series analysis and used to explain and predict data. In addition to the Arima model, there may be 3 more parameters than the purification model for use in seasonal data. These parameters are the autocorrelation, difference, etc. of seasonal data. are the values.

If we want to summarize step by step:

ARIMA + is seasonality
It can be used in univariate series that include trend and seasonality.
p, d, q are parameters from ARIMA. Trend elements *. ARIMA was able to model the trend.
p: actual value delay number (autoregressive degree)
- If P = 2, it is in yt-1 and yt-2 models.
d: number of different transactions (degree of difference)
q: error delay number (moving average degree)
If q = 2, it is in the et-1 and et-2 models.
P, D, Q seasonal lag numbers. Season elements.
m the number of time steps for a single seasonal period. It expresses the structure of seasonality.

0. Preparation of data

First of all, we divide our data as a train/test for our Arima Model. As in the previous article, while determining the data found before 1995 as a train set.

train = florida[:"1994-12-01"]
len(train)
test = florida["1995-01-01":]
len(test)

1. Arima Model

As the first step, we set up a basic Arima model and look at our test error, without applying any tune operation.

arima_model = ARIMA(train, order = (1,0,1)).fit(disp=0)
arima_model.summary()

y_pred= arima_model.forecast(225)[0]
mean_absolute_error(test, y_pred)
# 4.39981

In visualization, I visualize after 1985 for a comfortable and clear visual.

train["1985":].plot(legend=True, label = 'TRAIN')
test.plot(legend=True, label = 'TEST', figsize = (6,4))
pd.Series(y_pred, index=test.index).plot(legend=True, label = 'Prediction')
plt.title("Train, Test and Predicted Test")
plt.show()

As you can see, our guess doesn’t seem like a very good one. So let’s do a tuning process.

There are two commonly used techniques to optimize these statistical models. First, determining the model grade based on ACF and PACF charts. Second, determining the model grade based on AIC statistics.

1.1 Determining the model grade according to ACF & PACF Graphs

def acf_pacf(y, lags=30):plt.figure(figsize=(12, 7))layout = (2, 2)ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)acf_ax = plt.subplot2grid(layout, (1, 0))pacf_ax = plt.subplot2grid(layout, (1, 1))y.plot(ax=ts_ax)# Durağanlık testi (HO: Series is not Stationary. H1: Series is Stationary.)p_value = sm.tsa.stattools.adfuller(y)[1]ts_ax.set_title(‘Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}’.format(p_value))smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)plt.tight_layout()plt.show()acf_pacf(florida)

Here we are trying to give a better value to the p and q values in the graph.

Here, it is to look at the correlations between the ACF graph and the values of the previous periods in a time series. Here, we are looking at transactions with a lag value of 30.

So we’re looking at yt-1, yt-2, …. extending to yt-30.

Autocorrelation will not be significant at the points remaining in the light blue area on the graph, we have many meaningful but few meaningless autocorrelations.

If we want to summarize it briefly:

- It means that if the ACF width “DECREASES” according to the delays, and the PACF p is “CUT OFF” after the delay AR (p) pattern.

- If the ACF width q is “CUT OFF” after the delay, and the PACF width “DECREASES” according to the delay, it means that it is the MA (q) pattern.

- If the width of ACF and PACF decreases according to the lags, it means it is an ARMA model.

df_diff = florida.diff()df_diff.dropna(inplace=True)acf_pacf(df_diff)

We can easily see seasonality in autocorrelation. We cannot see a grade from this visual technique that could help determine a grade.

2 Determining Model Rank According to AIC & BIC Statistics

Now we are trying to find the best parameters by trying all combinations in the range we have determined to find the values that can give us the best result.

# Generation of combinations of p and q

p = d = q = range(0, 4)
pdq = list(itertools.product(p, d, q))

def arima_optimizer_aic(train, orders):
    best_aic, best_params = float("inf"), None

    for order in orders:
        try:
            arma_model_result = ARIMA(train, order).fit(disp=0)
            aic = arma_model_result.aic
            if aic < best_aic:
                best_aic, best_params = aic, order
            print('ARIMA%s AIC=%.2f' % (order, aic))
        except:
            continue
    print('Best ARIMA%s AIC=%.2f' % (best_params, best_aic))
    return best_params

best_params_aic = arima_optimizer_aic(train, pdq)

By adding the best values we have found to our ARIMA model, we create our tuned model.

arima_model = ARIMA(train, best_params_aic).fit(disp=0)
y_pred = arima_model.forecast(225)[0]
mean_absolute_error(test, y_pred)
# 1.4470

When we visualize our tuned model, we can say that we have a very successful model result compared to the previous one.

2. SARIMA Model

First of all, we have performed this step as we will need to create a validation in our dataset.

from statsmodels.tsa.statespace.sarimax import SARIMAX

train = florida[:"1994-12-01"]
len(train)
test = florida["1995-01-01":]
len(test)
val = train["1991-01-01":]
len(val)

2.1 Basic Model

We will do this with the function called SARIMAX.

model = SARIMAX(train, order=(1,0,1), seasonal_order=(0,0,0,12))
sarima_model = model.fit(disp=0)

2.2 Validation Error

pred = sarima_model.get_prediction(start = pd.to_datetime('1991-01-01'),dynamic=False)
pred_ci = pred.conf_int()

y_pred = pred.predicted_mean
mean_absolute_error(val, y_pred)
# 1.9192

2.3 Visualization of Validation Prediction

Our validation error looks pretty good compared to the base model.

2.4 Test Error

y_pred_test = sarima_model.get_forecast(steps=225)
pred_ci = y_pred_test.conf_int()
y_pred = y_pred_test.predicted_mean
mean_absolute_error(test, y_pred)
# 17.0051

We see that our real error is now relatively high. If we think of this as the actual score, the low validation error we saw in the previous step may indicate overfit.

2.5 Visualization of Prediction

We can see how bad the graph is.

2.6 Model Tuning

p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

def sarima_optimizer_aic(train, pdq, seasonal_pdq):
    best_aic, best_order, best_seasonal_order = float("inf"), float("inf"), None
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                sarimax_model = SARIMAX(train, order=param, seasonal_order=param_seasonal)
                results = sarimax_model.fit(disp=0)
                aic = results.aic
                if aic < best_aic:
                    best_aic, best_order, best_seasonal_order = aic, param, param_seasonal
                print('SARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, aic))
            except:
                continue
    print('SARIMA{}x{}12 - AIC:{}'.format(best_order, best_seasonal_order, best_aic))
    return best_order, best_seasonal_order

best_order, best_seasonal_order = sarima_optimizer_aic(train, pdq, seasonal_pdq)

Here we tried to keep the range small (2) due to the excess of variables. You can increase the range. Later, we wrote 12 because we know the seasonal period.

2.7 Final Model and Its Test Error

model = SARIMAX(train, order=best_order, seasonal_order=best_seasonal_order)
sarima_final_model = model.fit(disp=0)
###############################
### Final Model Test Error ###
###############################

y_pred_test = sarima_final_model.get_forecast(steps=225)
pred_ci = y_pred_test.conf_int()
y_pred = y_pred_test.predicted_mean
mean_absolute_error(test, y_pred)
# 1.013578872841977

2.8 Visualization of Final Model

2.9 SARIMA Optimizer Based on MAE

We optimized the model according to AIC, but we could also do this according to MAE. For this, I wanted to show it by opening an extra section here.

def fit_model_sarima(train, val, pdq, seasonal_pdq):
    sarima_model = SARIMAX(train, order=pdq, seasonal_order=seasonal_pdq).fit(disp=0)
    y_pred_val = sarima_model.get_forecast(steps=48)
    y_pred = y_pred_val.predicted_mean
    return mean_absolute_error(val, y_pred)

fit_model_sarima(train, val, (0, 1, 0), (0, 0, 0, 12))

p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]


def sarima_optimizer_mae(train, val, pdq, seasonal_pdq):
    best_mae, best_order, best_seasonal_order = float("inf"), float("inf"), None
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mae = fit_model_sarima(train, val, param, param_seasonal)
                if mae < best_mae:
                    best_mae, best_order, best_seasonal_order = mae, param, param_seasonal
                print('SARIMA{}x{}12 - MAE:{}'.format(param, param_seasonal, mae))
            except:
                continue
    print('SARIMA{}x{}12 - MAE:{}'.format(best_order, best_seasonal_order, best_mae))
    return best_order, best_seasonal_order


best_order, best_seasonal_order = sarima_optimizer_mae(train, val, pdq, seasonal_pdq)

model = SARIMAX(train, order=best_order, seasonal_order=best_seasonal_order)
sarima_final_model = model.fit(disp=0)

y_pred_test = sarima_final_model.get_forecast(steps=225)
pred_ci = y_pred_test.conf_int()
y_pred = y_pred_test.predicted_mean
mean_absolute_error(test, y_pred)
# 0.92