Overfitting your data: Don’t do it!
This week’s reading briefly discussed the concept of overfitting a model, this concept seemed rather important so I decided to investigate it further.
When choosing a model, it is important to ensure that the model fits your data set and that the data included in the model promotes the findings of your research. James et. al. (2013) explains, if your variable provides only a subtle increase to R2 (the square of the correlation) then you are better off leaving the variable out of the model. If you include the variable, it will likely lead to poor results on independent test samples, this is due to overfitting the model.
This concept of overfitting a model specifically relates to sample size, the larger the sample size the less likely you are to over fit a model when investigating complex research questions. Overfitting a model can cause the “regression coefficients, p-values, and R-squared to be misleading” (Frost, 2015). Using the marketing example from the James et al. (2013), the R² of all three advertising methods was 0.8972; the R² of TV and radio sales was 0.89719. Although the p-value of the newspapers was not significant, its inclusion still increased the R² value slightly. Since adding another variable will always increase R² it must allow you to fit the data more accurately (James et. al., 2013).
Babyak (2004) suggests the following strategies to avoid overfitting, (1) collect more data, the larger the sample size, the less likely you are to over fit the model, (2) combine predictors, reduce the number of variables in a model by combining closely correlated predictors or fix some of the regression coefficients if they have been over studied, (3) use a shrinkage estimator (included in many statistical packages), this allows you to see what the estimated model would look like against a new data set.
Let’s take a look at an example provided by Babyak (2004) regarding eliminating regression coefficients of over studied work. Research literature is crowded with studies proving a consistent risk ratio of age and cardiac events. Due to the amount of accepted research findings, Babyak suggests that we can accept that this relation has been proven and can be removed from the data. This action will save a degree of freedom and help to avoid overfitting the model.
An application of removing a regression coefficient in my area of interest, education and learning science, might occur if we are studying the outcome of a new reading curriculum for Kindergarten. If the data collected included information regarding the link between low SES and low word recognition in Kindergarten, we might eliminate this coefficient regression because this topic has been highly research and is generally an accepted conclusion. (You can find more examples of these strategies in Babyak’s article.)
Replication of research assists in the validation of your study, but if you make the mistake of overfitting a model, a failure to replicate is likely to occur (Frost, 2015). Ultimately this will create skepticism and raise questioning about your research, a reputation no professional wants to have.
Babyak, M. A. (2004). What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic medicine, 66(3), 411–421.
Frost, J. (2015). The danger of overfitting regression models. Retrieved from http://blog.minitab.com/blog/adventures-in-statistics-2/the-danger-of-overfitting-regression-models
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013) An Introduction to Statistical Learning: with Applications in R. New York, New York: Springer New York