Homoscedasticity vs Heteroscedastcity
Homoscedasticity is constant (or homogeneous) variance in a set of random variables. You may be wondering how it’s possible for variance to change. Isn’t it a single number?
This is where the idea of a set of variables factors in. We are not looking at a single variable in isolation, we are looking at the relationship between a combination of variables. For example, your dependent variable has actual and predicted values. When talking about the variance in this set, we are referring to the variance of the predicted values (or their error) as a function of the actual value.
One of the fundamental assumptions of linear regression is that the error in the predictions is homoscedastic. When this assumption is violated, measures of goodness of fit are no longer reliable. Homoscedacity is also required for anaysis of variance (ANOVA) tests.
Heteroscedacity, as you may have guessed, is heterogeneous variance. This is a common characteristic of many real-world relationships. For example, if you consider the relationship between income and how much money people spend on food, those who have higher incomes will have greater flexibility of choice in how much they spend. Some may spend more on luxury ingredients or dining out while others may have simpler, more frugal habits, leading to a high variance. Those with less income on the other hand will have much more limited budgets and less variance.
You can test for heteroscedacity for linear regression with the Breusch-Pagan test, which checks for dependence between the variance in errors and the independent variables. This test is available in Python in the statsmodel
package.
If any of your independent variables do have heteroscedacity, you can still use them with linear models by applying a correction:
- Convert the individual feature to a log scale
- Convert the target to a log scale or other appropriate transformation
- Apply weights to the training data in weighted least squares estimation
- Use heteroscedasticity-consistent standard error (HCSE) estimation
These alternatives to ordinary least squares estimation are available in the statsmodel
package, which also has a thorough tutorial on WLS.
Homoscedacity, heteroscedacity and other key concepts for working with models in real-world settings are covered in my Machine Learning Flashcards: Modeling Core Concepts deck. Check it out on Etsy!