Learn Heteroskedasticity in 2 minutes

Dmitry Yemelyanov
Riga Data Science Club
2 min readAug 26, 2020

Definition

Heteroskedasticity. Scary word, isn’t it? Nothing to be afraid of. This term refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it. Still scary?

Intuition

Let’s skip complex definitions and try to get intuition by example. Suppose you have a dataset consisting of annual income vs age. Most probably you will observe people in their twenties having little variance in income, as at this point of life they are all equal at the very beginning of their career. You will see much higher variance among people in their thirties and forties due to some people being more fortunate to get successful and other not.

Heteroskedasticity example: Income vs Age (data is not real)

The easiest way to observe this effect visually is by making a scatter plot. If the graph has a cone shape as seen above, you are probably dealing with heteroscedasticity. The opposite is homoscedasticity:

Homoscedasticity vs Heteroscedasticity

Why should you care

You may ask “why should I care about this?”. Heteroscedasticity might make linear regression modeling ineffective, therefore you will have to make sure your data doesn’t have this condition or choose another model. Stay safe!

Resources

You can find a real-life example of Heteroskedasticity in the Kaggle kernel written by Riga Data Science Club member, Dan Yachmenev: https://www.kaggle.com/danilyachmenev13/riga-eda-geopy-and-model-comparison

Apartment price vs area

--

--

Dmitry Yemelyanov
Riga Data Science Club

Founder at Riga Data Science Club | Machine Learning Consultant