Influence of dependent variable (y)’s scale on AIC, BIC

Published in

Analytics Vidhya

4 min readJan 19, 2021

Introduction

When considering model selection criteria for nested statistical models, AIC and BIC usually comes to our mind. These two metrics are derived base on the maximized log-likelihood value with a tradeoff of number of predictors.

k is number of predictors (image from wikipedia)

k is number of predictors, n is number of observations. BIC puts a heavier penalization on the number of predictors (if n > 4) and thus will prefer a simpler model (image from wikipedia)

However, it has not been stressed enough about the influence of the dependent variable y’s scale impact on these metrics.

Here, I will try to illustrate through a simulated dataset.

(full exercise available on my GitHub)

Simulated Data Setting

1e4 observations, with 9 features, each follows a gamma distribution

noise follows a standard normal

how y is actually generated. This will be unknown in reality.

N = 1000, N_FEATURES = 9

Scenario

Now, forget about the simulated setting for a moment. Imagine we are giving the following dataset:

Let’s say after some initial exploration, we nail down to 2 models (both simple linear regression, but different features):

candidate model 1

candidate model 2

We decided to use AIC as the model selection criteria. Since model 1 provides smaller AIC, we decided to use model 1 as our final choice.

Now, imagine we are giving another dataset, with different outcome w but same feature spaces. Again, we decided to use AIC and test on the same aforementioned models.

This time, the AIC is smaller for model 2.

So, here is what have happened so far: we a given 2 datasets with same features space. Dataset 1 have smaller AIC for candidate model 1(w ~ x1+x5+x6+x7+x8+x9), while dataset 2 have smaller AIC for candidate model 2 (w ~ x5+x6+x7+x8+x9).

Surprise Surprise ~

Now, behold the big revelation:

In dataset 1: w = y

In dataset 2: w = y⁴

Since this is a simulated data, we know that y only depends on X1, X2, X3, X4. However, by changing the scale of y, we found the AIC will actually prefer model 2 (which does not contain any real predictors !!).

Level of Scale Change on outcome

If we repeat this on each different dataset, with outcome variable w created in the following way:

the previous example is for p=1 and p=4

When p ≥ 3.5, AIC starts to pick the wrong model. In another word, noise in the outcome overshadows the true signal from features at that point.

each line represents model selection results on a new dataset (wj, x1j, …, x9j), where wj = yj**p

what if the effect of X1 on y is 10 times stronger?

Recall X1’s effect on y is not too strong.

y = 2*X1 + 2*X2 + 10*X3 + 100*X4

What if we increase the relationship between y and X1, let’s say 10 times?

y = 20*X1 + 2*X2 + 10*X3 + 100*X4

then, model 2 have smaller AIC when p ≥9

same exercise, on a new dataset where X1’s effect on y is 10 times stronger than before

This implies that scale of outcome could impact AIC’s accuracy on picking the right model. However, if the effect of features is strong, the model will be less susceptible to the scale of y.

Conclusion

We need to be careful when using metrics like AIC, BIC for model selection, especially when the values of candidate model are relatively close.

How do we know if the values are close? We can take some power (ex 3 or 4) on the outcome and see if the top choice of model swap or not.

We also show that different scale of y will likely to lead into different selection results. However, if the effect of X is stronger, the accuracy of the selection result will be much more robust.

Reference

Akaike information criterion

The Akaike information criterion ( AIC) is an estimator of out-of-sample prediction error and thereby relative quality…

en.wikipedia.org

Bayesian information criterion

In statistics, the Bayesian information criterion ( BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a…