2- Econometric & Statistical Models in Time Series

May 3 · 9 min read

Don’t explain to me, show the code

ARIMA(p, d, q) : (Autoregressive Integrated Moving Average)

ARIMA models are used to review time series data and make predictions for the future. Common usage is Arima (p, d, q). p is used for the autoregressive (ar (p)) part, and d indicates how many times the data series differ. (the new series obtained by subtracting each data from the previous data) Finally, q is used for the moving average (ma (q)) part. In this case, Arima (1,0,0) and ar (1) are the same.
arima (0,0,1) and ma (1) are the same.

If we want to summarize step by step:

• Estimation is made by a linear combination of observations and errors that differed from previous time steps.
• It is suitable for the univariate, trend but not seasonal data.
• p: actual value delay number (autoregressive degree)
- If p = 2, it is in yt-1 and yt-2 models.
• d: number of different operations (degree of difference, I)
• q: error delay number (moving average degree)
• What does it mean to be noticed?
- For example, it is to subtract the values of today and the previous day from each other.
• What does the spotting process do, why do we get noticed?
- To stabilize the series.
• What does it mean to stabilize the series?
- To eliminate the situation that the statistical properties of the series change with time.
• As it can be understood, the process of taking the difference here has created the opportunity to make predictions in the series that are now in trend.

SARIMA (p, d, q) (P, D, Q)m: (Seasonal Autoregressive Integrated Moving-Average)

It is a method used especially in time series analysis and used to explain and predict data. In addition to the Arima model, there may be 3 more parameters than the purification model for use in seasonal data. These parameters are the autocorrelation, difference, etc. of seasonal data. are the values.

If we want to summarize step by step:

• ARIMA + is seasonality
• It can be used in univariate series that include trend and seasonality.
• p, d, q are parameters from ARIMA. Trend elements *. ARIMA was able to model the trend.
• p: actual value delay number (autoregressive degree)
- If P = 2, it is in yt-1 and yt-2 models.
• d: number of different transactions (degree of difference)
• q: error delay number (moving average degree)
• If q = 2, it is in the et-1 and et-2 models.
• P, D, Q seasonal lag numbers. Season elements.
• m the number of time steps for a single seasonal period. It expresses the structure of seasonality.

0. Preparation of data

`train = florida[:"1994-12-01"]len(train)test = florida["1995-01-01":]len(test)`

1. Arima Model

`arima_model = ARIMA(train, order = (1,0,1)).fit(disp=0)arima_model.summary()y_pred= arima_model.forecast(225)[0]mean_absolute_error(test, y_pred)# 4.39981`

In visualization, I visualize after 1985 for a comfortable and clear visual.

`train["1985":].plot(legend=True, label = 'TRAIN')test.plot(legend=True, label = 'TEST', figsize = (6,4))pd.Series(y_pred, index=test.index).plot(legend=True, label = 'Prediction')plt.title("Train, Test and Predicted Test")plt.show()`

As you can see, our guess doesn’t seem like a very good one. So let’s do a tuning process.

There are two commonly used techniques to optimize these statistical models. First, determining the model grade based on ACF and PACF charts. Second, determining the model grade based on AIC statistics.

1.1 Determining the model grade according to ACF & PACF Graphs

`def acf_pacf(y, lags=30):plt.figure(figsize=(12, 7))layout = (2, 2)ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)acf_ax = plt.subplot2grid(layout, (1, 0))pacf_ax = plt.subplot2grid(layout, (1, 1))y.plot(ax=ts_ax)# Durağanlık testi (HO: Series is not Stationary. H1: Series is Stationary.)p_value = sm.tsa.stattools.adfuller(y)[1]ts_ax.set_title(‘Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}’.format(p_value))smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)plt.tight_layout()plt.show()acf_pacf(florida)`

Here we are trying to give a better value to the p and q values in the graph.

Here, it is to look at the correlations between the ACF graph and the values of the previous periods in a time series. Here, we are looking at transactions with a lag value of 30.

So we’re looking at yt-1, yt-2, …. extending to yt-30.

Autocorrelation will not be significant at the points remaining in the light blue area on the graph, we have many meaningful but few meaningless autocorrelations.

If we want to summarize it briefly:

- It means that if the ACF width DECREASES” according to the delays, and the PACF p is “CUT OFF” after the delay AR (p) pattern.

- If the ACF width q is “CUT OFF” after the delay, and the PACF width DECREASES” according to the delay, it means that it is the MA (q) pattern.

- If the width of ACF and PACF decreases according to the lags, it means it is an ARMA model.

`df_diff = florida.diff()df_diff.dropna(inplace=True)acf_pacf(df_diff)`

We can easily see seasonality in autocorrelation. We cannot see a grade from this visual technique that could help determine a grade.

2 Determining Model Rank According to AIC & BIC Statistics

Now we are trying to find the best parameters by trying all combinations in the range we have determined to find the values that can give us the best result.

`# Generation of combinations of p and qp = d = q = range(0, 4)pdq = list(itertools.product(p, d, q))def arima_optimizer_aic(train, orders):    best_aic, best_params = float("inf"), None    for order in orders:        try:            arma_model_result = ARIMA(train, order).fit(disp=0)            aic = arma_model_result.aic            if aic < best_aic:                best_aic, best_params = aic, order            print('ARIMA%s AIC=%.2f' % (order, aic))        except:            continue    print('Best ARIMA%s AIC=%.2f' % (best_params, best_aic))    return best_paramsbest_params_aic = arima_optimizer_aic(train, pdq)`

By adding the best values we have found to our ARIMA model, we create our tuned model.

`arima_model = ARIMA(train, best_params_aic).fit(disp=0)y_pred = arima_model.forecast(225)[0]mean_absolute_error(test, y_pred)# 1.4470`

When we visualize our tuned model, we can say that we have a very successful model result compared to the previous one.

2. SARIMA Model

`from statsmodels.tsa.statespace.sarimax import SARIMAXtrain = florida[:"1994-12-01"]len(train)test = florida["1995-01-01":]len(test)val = train["1991-01-01":]len(val)`

2.1 Basic Model

We will do this with the function called SARIMAX.

`model = SARIMAX(train, order=(1,0,1), seasonal_order=(0,0,0,12))sarima_model = model.fit(disp=0)`

2.2 Validation Error

`pred = sarima_model.get_prediction(start = pd.to_datetime('1991-01-01'),dynamic=False)pred_ci = pred.conf_int()y_pred = pred.predicted_meanmean_absolute_error(val, y_pred)# 1.9192`

2.3 Visualization of Validation Prediction

Our validation error looks pretty good compared to the base model.

2.4 Test Error

`y_pred_test = sarima_model.get_forecast(steps=225)pred_ci = y_pred_test.conf_int()y_pred = y_pred_test.predicted_meanmean_absolute_error(test, y_pred)# 17.0051`

We see that our real error is now relatively high. If we think of this as the actual score, the low validation error we saw in the previous step may indicate overfit.

2.5 Visualization of Prediction

We can see how bad the graph is.

2.6 Model Tuning

`p = d = q = range(0, 2)pdq = list(itertools.product(p, d, q))seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]def sarima_optimizer_aic(train, pdq, seasonal_pdq):    best_aic, best_order, best_seasonal_order = float("inf"), float("inf"), None    for param in pdq:        for param_seasonal in seasonal_pdq:            try:                sarimax_model = SARIMAX(train, order=param, seasonal_order=param_seasonal)                results = sarimax_model.fit(disp=0)                aic = results.aic                if aic < best_aic:                    best_aic, best_order, best_seasonal_order = aic, param, param_seasonal                print('SARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, aic))            except:                continue    print('SARIMA{}x{}12 - AIC:{}'.format(best_order, best_seasonal_order, best_aic))    return best_order, best_seasonal_orderbest_order, best_seasonal_order = sarima_optimizer_aic(train, pdq, seasonal_pdq)`

Here we tried to keep the range small (2) due to the excess of variables. You can increase the range. Later, we wrote 12 because we know the seasonal period.

2.7 Final Model and Its Test Error

`model = SARIMAX(train, order=best_order, seasonal_order=best_seasonal_order)sarima_final_model = model.fit(disp=0)################################## Final Model Test Error ##################################y_pred_test = sarima_final_model.get_forecast(steps=225)pred_ci = y_pred_test.conf_int()y_pred = y_pred_test.predicted_meanmean_absolute_error(test, y_pred)# 1.013578872841977`

2.8 Visualization of Final Model

2.9 SARIMA Optimizer Based on MAE

We optimized the model according to AIC, but we could also do this according to MAE. For this, I wanted to show it by opening an extra section here.

`def fit_model_sarima(train, val, pdq, seasonal_pdq):    sarima_model = SARIMAX(train, order=pdq, seasonal_order=seasonal_pdq).fit(disp=0)    y_pred_val = sarima_model.get_forecast(steps=48)    y_pred = y_pred_val.predicted_mean    return mean_absolute_error(val, y_pred)fit_model_sarima(train, val, (0, 1, 0), (0, 0, 0, 12))p = d = q = range(0, 2)pdq = list(itertools.product(p, d, q))seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]def sarima_optimizer_mae(train, val, pdq, seasonal_pdq):    best_mae, best_order, best_seasonal_order = float("inf"), float("inf"), None    for param in pdq:        for param_seasonal in seasonal_pdq:            try:                mae = fit_model_sarima(train, val, param, param_seasonal)                if mae < best_mae:                    best_mae, best_order, best_seasonal_order = mae, param, param_seasonal                print('SARIMA{}x{}12 - MAE:{}'.format(param, param_seasonal, mae))            except:                continue    print('SARIMA{}x{}12 - MAE:{}'.format(best_order, best_seasonal_order, best_mae))    return best_order, best_seasonal_orderbest_order, best_seasonal_order = sarima_optimizer_mae(train, val, pdq, seasonal_pdq)model = SARIMAX(train, order=best_order, seasonal_order=best_seasonal_order)sarima_final_model = model.fit(disp=0)y_pred_test = sarima_final_model.get_forecast(steps=225)pred_ci = y_pred_test.conf_int()y_pred = y_pred_test.predicted_meanmean_absolute_error(test, y_pred)# 0.92`

References

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Written by

Ogulcan Ertunc

I’m a Master Student🎓pursuing Data Analytics. I’m a Data Enthusiast 💻 😃 passionate about learning and working with new tech. https://github.com/ogulcanertunc

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

NLP: Detecting Explicit Content in Music Lyrics

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app