Heteroskedasticity vs. Homoskedasticity→ Assumption of Linear Reg

3 min readMay 12, 2023

Linear Regression model should be validated for all model assumptions including the definition of the functional form. If the assumptions are violated, we need to revisit the model.Since Linear Regression is the quantitative analysis we need to validate some assumption about the data and the predicted model.

Assumption on Data before Training the Model : Multicollinearity,Linear Relationship,No Hidden Value

Assumption on Model after Training the Model : Normality of Residuals,Homoscedasticity

Lets Discuss Homoscedasticity vs Heteroskedasticity

Homoscedasticity (homo — equal , scedasticity — spread):

Homoscedasticity in a model means that the error is constant along the values of the dependent variable and refers to situations where the residuals are equal across all the dependent variables.If a model is homoskedastic, we can assume that the residuals are drawn from a population with constant variance.

It would satisfy one of the assumptions of the OLS regression and ensure that the model is more accurate.It means a constant error, you are looking for a constant deviation of the points from the zero-line.

Heteroskedasticity:

It refers to situations where the variance of the residuals is unequal over a range of measured (dependent) values.

When running a regression analysis, heteroskedasticity results in fan or cone shape in the residual plot.Seeing patterns in the errors is an indication there’s something missing from your model that is generating those patterns.

Why Heteroskedasticity occurs?

Models that utilize a wider range of observed values are more prone to heteroscedasticity.

This is generally because the difference between the smallest and large values is more significant in these datasets, thus increasing the chance of heteroscedasticity.

For example ,if you analyzed retail e-commerce sales for the past 30 years, the number of sales over the past 10 years would be significantly larger due to the recent prevalence of online shopping. It would potentially skew the residuals and result in heteroskedasticity.

Another example is income vs. age. Younger people generally have access to a lower range of jobs, and most will sit close to minimum wage. As the population ages, the variability in job access will only expand. Some will stick closer to minimum wage, others will become very successful, etc.

Ways to fix Heteroskedasticity:

1.Weighted ordinary least squares : Weight is assigned to high-quality observations to obtain a better fit. Weighted regression minimizes the sum of the weighted squared residuals,replacing heteroscedasticity with homoscedasticity

2.Transform the dependent variable : Here, population age (the independent variable) is used to predict the job access (the depenent variable). Instead, the population age could be used to predict the log of the job access.

3.Redefine the dependent variable : Another method is to redefine the dependent variable. One possibility is to use the rate of the dependent variable rather than the raw value. Instead of using population age to predict the Job access, we use population age to predict the Job per capita. So here, we’re measuring the number of jobs per individual rather than the raw number of Jobs.

Remember, heteroscedasticity does not necessarily occur in absence of other assumptions we saw at the top — so don’t forget to check all linear regression assumptions before addressing heteroscedasticity.

Will see further Assumptions in upcoming posts.

Thanks for Reading!

Please do follow me for Data Science and Bigdata updates

@LinkedIn

Keep Learning…

Heteroskedasticity vs. Homoskedasticity→ Assumption of Linear Reg

Written by Sandhya S