A Complete Introduction To Time Series Analysis (with R):: Model Selection for ARMA(p,q)

In the last section, we learned about Gaussian Time Series, a powerful and flexible assumption when it comes to ARMA(p,q) parameters estimation. In this article, we will see how we can select the “best” model among a number of fitted models.
Maximum Likelihood Model Selection
In the previous section, we saw how Gaussian assumptions allow us to obtain and maximize the likelihood of some ARMA(p,q) process to obtain parameter estimates for the thetas and alphas. That is, for any model, we are looking to find parameters such that

Equivalently, in general, we consider the log-likelihood for ease of optimization, given by

Now, imagine we have some time series X_{t}, and we fit two models: and ARMA(4,2) and an ARMA(5,3). The question is, cannot we just use the raw likelihood of each of these models to choose one over the other? Definitely not! The resultant parameters are chosen precisely to maximize the likelihood under some local maximization space, but this is obviously a very “optimistic” measure. Why is this?
- Suppose that we calculate the log-likelihood for M different models.
- We could choose the model with the largest likelihood.
- However, the log-likelihood is an optimistic measure of fit because choosing the model with the largest log-likelihood can easily mean choosing the model with the most parameters!
- Indeed, choosing a model with more parameters may appear to provide a better fit simply because smaller models are contained within.

So, how do we solve this issue? The idea is to add regularization by penalizing model complexity, that is, considering

where, in general f(x):= -2x, and then perform the model selection by choosing whichever model has the lowest value. The selection criteria we consider for this task are the AIC, AICc, and BIC metrics, presented next.
AIC, AICc, and BIC metrics
Consider the log-likelihood of some maximized ARMA(p,q) model. To assess fit, the following metrics are defined:

You can find a very nice theoretical derivation of these here, but I will now go over the details. However, here are a couple of notes:
- AIC stands for “Akaike Information Criterion”, and comes from minimizing the Kullback-Leibler divergence, a statistical measure of how far are two distributions.
- BIC follows a similar principle, but stands for “Bayesian Information Criterion”.
- AICc performs in practice similar to the AIC but adjusts for over-penalization.
- BIC focuses more on assessing the true fit of the model, often sacrificing some predictive power for the sake of a model that is closer to the truth.
- For n > e², BIC penalizes more than the AIC/AICc and tends to prefer simpler models.
- BIC tends to perform badly when the data is not underlyingly normal.
- In practice, we calculate all three metrics for every model, and if there is a disagreement, we further investigate the models in question and decide whether the objective is to obtain more predictive power rather than a model closer to the true model.
How to R
As usual, we start by importing some libraries:
In this example, we are going to work with the Wave Tank data, in which a group of researchers simulated the properties of waves in a tank according to certain disturbances. The data can be downloaded from here. First, we read and inspect the data, and then transform it into an appropriate ts
object.
Let’s visualize the data and its ACF and PACF:



Because these are waves, there is no suprise that there is seasonality. We also see that we get a lot more attenuation of the autocorrelation. Next, we are going to fit a bunch of models, which will compare later. We will also inspect the residuals for each.
AR(2)

The model is such a bad fit that the test degrees of freedom don’t even make sense. If we check the residuals, we do see a bunch of lag points that go outside the confidence bounds, however the underlying data looks approximatedly normal. It might be hard to determine whether that was white noise or not.
AR(4)

Same as before.
ARMA(2,2)

Once again, we still get a lot of points out of bounaries in the ACF. It’s hard to say whether there is any white noise at all or not.
ARMA(4,4)

Okay, this one is much better than the previous ones, but now normality looks kind of doubtious.
Stationarity tests
Let’s use another function to verify the residuals, like the ones we saw in this article. Let’s start by the ARMA(2,2) we fitted before:

Most of the tests fail to rejel the null for iid noise, except for the difference signs tests. The residuals indeed look like white noise, and the normal QQ plot also testifies underlying normal assumptions. How about the ARMA(4,4) ?

Similar to before, but we obtain much stronger results. You might think that this is a reasonable model to go ahead and fit. However!!!! You might be uncomfortable only using these diagnostics. Why? Because these do not really tell us which model REALLY is better!
AIC, AICc, BIC
Instead, we should use the evaluation metrics we presented at the beginning of this article to perform model selection. We do it as follows:
Here, the first output is the AIC for all the models, and the second one is the BIC. We see that based on both the AIC and BIC, we should indeed pick the ARMA(4,4) model.
AutoArima
One awesome tool that you can use, however, the auto.arima
function, which will help us automatically fit a number of models and report their AIC/BIC accordingly, as well as return the model that has the lowest evaluation metric among all the fitted models. How convenient! Let’s start by fitting based on the AIC criterion:
Note that for some of these, the value is reported as “inf”. This means the fitted model is so bad that the AIC metric ends up making no sense! The output model is an ARIMA(4,0,1) = ARMA(4,1) model (we will explore ARIMA and SARIMA models later). We can inspect the resulting model:
We can repeat this exercise using AICc and BIC:
We see that under all criteria, we obtain the same best model. This is not always the case, but usually, this is what happens. If you end up with different models, choose according to whether you need more prediction power (AIC) or a more “truthful” model (BIC). Let’s check the residuals:

This is not that bad; although we see that we are still missing part of the seasonal component, this is a good first attempt. We will revisit this in the SARIMA models later.
Predictions
Now, let’s go with my favorite part: producing some predictions! Can you remember which algorithm(s) is/are used for this? :)


In the table above, we see the forecasted point, and 80% and 95% confidence intervals, which tells us where the true observation data is most likely to be, in contrast to its forecast.
Last time
Next Time
That’s it! Next time, we will explore the ARIMA(p,d,q) models, and later, the so-called Seasonal ARIMA or SARIMA models. Stay tuned, and happy learning!