Do you know what homoscedasticity is, one of the premises of the OLS regression model?

Emerson Santos
LatinXinAI
Published in
5 min readMay 17, 2024

Scedasticity means dispersion or variance. In the context of regression, Scedasticity is coined in relation to the variance of the residuals (error terms) of a model. Thus, homoscedasticity is the characteristic of a model having a constant variance of its residuals throughout the entire range of predicted values. In other words, homoscedasticity occurs when the difference between the actual values ​​and the predicted values ​​of the response variable does not have increasing or decreasing trends. You can see an example in the image below.

Analyzing homoscedasticity through a residual graph is important, as the Ordinary Least Squares Linear Regression model assumes, in one of its 7 premises, that all residuals are extracted from a population that has a constant variance. Whenever one of the assumptions is not respected, you can trust your model less.

Detecting heteroscedasticity can be an important indicator that there is a flaw in your regression model. Now let’s understand how we can identify some faults and how to correct them to improve the reliability of your model.

Firstly, it is important to label the two types of heteroscedasticity:

Pure Heteroscedasticity: occurs when the model is correct, but heteroskedasticity is still observed. An example of this is when you want to predict a target that varies on a large scale. Imagine that your target ranges between 1 and 10000. As you can imagine, the nominal value of the residual will be larger each time the model predicts increasingly larger targets. This naturally causes heteroskedasticity. Another factor causing heteroscedasticity is the increase in the variability of the dependent variable over the range of values ​​of the independent variables.

Impure Heteroscedasticity: occurs when the model is incorrect and this is what causes heteroscedasticity. An example of this is failing to consider an important feature in the modeling, so what happens is that the portion of influence of that feature on the target is completely absorbed by the residuals. If this feature grows over the data range, the residuals also grow and the variance is not constant.

Now let’s propose some solutions to correct what may most commonly be causing heteroscedasticity in your model:

Solution 1:

When analyzing the residual plot and identifying heteroskedasticity, you must first study what may be causing it. A first common alternative used is to analyze the data scale of your dependent variable. If the scale is relatively high, some strategies can be used to overcome the problem of heteroscedasticity.

If the scale is relatively high, as we have seen, there will naturally be an incidence of heteroscedasticity due to the growth of the nominal value of the residual as a function of increasingly larger forecast values. For this case, you must apply a data transformation on the dependent variable and this can be done by many different strategies. The image below shows a residual graph for a model with this failure:

For example, depending on what you want to model, it may be an alternative to want to predict, instead of the nominal value of the dependent variable, the percentage of this dependent variable. Doing this transformation naturally reduces the scale at which you work with the dependent variable and reduces heteroscedasticity.

An proper approach is also to weigh all data points by a P factor and use weighted regression. For example, if you know that the variance of the waste increases with the increase in the expected value of the goal, but due to an increase in an independent variable x, you can use the value p = 1/x as a weight column and use the model of Weighted regression to adjust the data. This type of model reacts better to heterocedic data and can provide more reliable conclusions in this case.

An alternative is also to test other types of data transformations such as Stardartization, Normalization, exp, log, etc., and analyze whether heteroscedasticity has been reduced.

Solution 2:

If the scale is not high and you can still see an inconsistency in the variance of the residuals, there is probably some independent variable that was left out of the modeling. Then consider more variables and reshape again. A missing variable can even be an iteration term involving variables that you have previously included in the model.

If the problem persists and you are using a grade 1 linear regression model, there is probably a non-linear relationship between a feature and the target that the model cannot learn. Try using a linear regression model with degree greater than 1 or even non-linear models. The image below shows a residual graph for a model with these flaws:

In conclusion, above are some strategies that can be adopted to reverse both pure heteroscedasticity and impure heteroscedasticity. Sometimes they will not be enough to reverse the problem and a mastery of the field of data study will be necessary to trace in more depth why heteroscedasticity is present or even to what extent you can tolerate it without it affecting your results too much. In statistical inference, for example, having heteroscedasticity will mask the confidence intervals of the regression model coefficients, when in fact they are wider than was calculated.

Therefore, analyzing residuals in search of heteroscedasticity is an important way both to detect flaws in your model so that improvements can be made, and also to ensure the reliability of your results.

I hope this article has somehow provided you with some value. Feel free to leave suggestions or comments below or find me on Linkedin.

Emerson Santos

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--

Emerson Santos
LatinXinAI

As a qualified Data Scientist and Engineer with experience in solving challenging problems, I propose innovative and creative solutions using Data Science.