Learn Heteroskedasticity in 2 minutes
Definition
Heteroskedasticity. Scary word, isn’t it? Nothing to be afraid of. This term refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it. Still scary?
Intuition
Let’s skip complex definitions and try to get intuition by example. Suppose you have a dataset consisting of annual income vs age. Most probably you will observe people in their twenties having little variance in income, as at this point of life they are all equal at the very beginning of their career. You will see much higher variance among people in their thirties and forties due to some people being more fortunate to get successful and other not.
The easiest way to observe this effect visually is by making a scatter plot. If the graph has a cone shape as seen above, you are probably dealing with heteroscedasticity. The opposite is homoscedasticity:
Why should you care
You may ask “why should I care about this?”. Heteroscedasticity might make linear regression modeling ineffective, therefore you will have to make sure your data doesn’t have this condition or choose another model. Stay safe!
Resources
You can find a real-life example of Heteroskedasticity in the Kaggle kernel written by Riga Data Science Club member, Dan Yachmenev: https://www.kaggle.com/danilyachmenev13/riga-eda-geopy-and-model-comparison