Influence of dependent variable (y)’s scale on AIC, BIC
Introduction
When considering model selection criteria for nested statistical models, AIC and BIC usually comes to our mind. These two metrics are derived base on the maximized log-likelihood value with a tradeoff of number of predictors.
However, it has not been stressed enough about the influence of the dependent variable y’s scale impact on these metrics.
Here, I will try to illustrate through a simulated dataset.
(full exercise available on my GitHub)
Simulated Data Setting
Scenario
Now, forget about the simulated setting for a moment. Imagine we are giving the following dataset:
Let’s say after some initial exploration, we nail down to 2 models (both simple linear regression, but different features):
We decided to use AIC as the model selection criteria. Since model 1 provides smaller AIC, we decided to use model 1 as our final choice.
Now, imagine we are giving another dataset, with different outcome w but same feature spaces. Again, we decided to use AIC and test on the same aforementioned models.
This time, the AIC is smaller for model 2.
So, here is what have happened so far: we a given 2 datasets with same features space. Dataset 1 have smaller AIC for candidate model 1(w ~ x1+x5+x6+x7+x8+x9), while dataset 2 have smaller AIC for candidate model 2 (w ~ x5+x6+x7+x8+x9).
Surprise Surprise ~
Now, behold the big revelation:
In dataset 1: w = y
In dataset 2: w = y⁴
Since this is a simulated data, we know that y only depends on X1, X2, X3, X4. However, by changing the scale of y, we found the AIC will actually prefer model 2 (which does not contain any real predictors !!).
Level of Scale Change on outcome
If we repeat this on each different dataset, with outcome variable w created in the following way:
When p ≥ 3.5, AIC starts to pick the wrong model. In another word, noise in the outcome overshadows the true signal from features at that point.
what if the effect of X1 on y is 10 times stronger?
Recall X1’s effect on y is not too strong.
y = 2*X1 + 2*X2 + 10*X3 + 100*X4
What if we increase the relationship between y and X1, let’s say 10 times?
y = 20*X1 + 2*X2 + 10*X3 + 100*X4
then, model 2 have smaller AIC when p ≥9
This implies that scale of outcome could impact AIC’s accuracy on picking the right model. However, if the effect of features is strong, the model will be less susceptible to the scale of y.
Conclusion
We need to be careful when using metrics like AIC, BIC for model selection, especially when the values of candidate model are relatively close.
How do we know if the values are close? We can take some power (ex 3 or 4) on the outcome and see if the top choice of model swap or not.
We also show that different scale of y will likely to lead into different selection results. However, if the effect of X is stronger, the accuracy of the selection result will be much more robust.