Homoscedasticity

Somayeh Youssefi
3 min readDec 24, 2023

--

In this post we will learn about homoscedasticity. The notebook can be found here.

One crucial assumption in linear regression is homoscedasticity which requires that the variance of the errors to be constant across all levels of the independent variable(s). Simply, it means that the residuals should be spread roughly the same across the entire range of predicted values. If this assumption is violated, then the model will not perform accurately. Homoscedasticity can be evaluated visually or there are several statistical tests that can be applied. If we plot the residuals (the differences between observed and predicted values) versus the predicted values, we will have the residual plot. If the data set is homoscedastic, then the points will be scattered randomly and evenly across the plot, without showing any recognizable pattern. The presence of a shape or a pattern in the residual plot suggests heteroscedasticity (non-constant variance).

To visually test homoscedasticity, I generated two sets of data around a line and draw the residual plot. As can be seen below, the residuals are scattered randomly around zero. and we cannot see any pattern in them.

The second sample series is shown below, and we can see a funnel shape in the residual plot. As the predicted value increases, we can see that variance is also increasing.

There are several statistical tests for homoscedasticity. One of these tests is Goldfeld-Quandt Test. To perform the test, we first need to split the dataset into two groups and then compare the variance of the residuals in these two groups. If the variances are significantly different, it indicates potential heteroscedasticity.

I will do the Goldfeld-Quandt tests first manually and then by using statsmodel package.

Goldfeld-Quandt test

I first look at plot of the sample points and choose the value of X=0.2, as the point to split my data set into two groups. At X=0.2, I have 28 samples in my first group and 72 samples in my second group. Then I fit a linear regression on each group and estimate RSS/dof for each group. RSS is the residual sum of squares and dof is the degree of freedom. The test statistic, which is F statistic, is the ratio of this value for the bigger group to the one for the smaller group. The critical F value for alpha = 0.05 can be read in an F table and is ~1.5. The F-value that I estimated for my two groups is 20.68, and because it is bigger than the critical F value, I can reject the null hypothesis. But what is the null hypothesis in the Goldfeld-Quandt test?

The null hypothesis in the Goldfeld-Quandt test is that homoscedasticity is present and there is not a significant difference between the variance of the two groups. Therefore, based on my f-value and a p-value of 0.05 I can conclude that heteroscedasticity is present in my dataset, and I need to address it before drawing any conclusion from my modeling.

The easier approach to do the Goldfeld-Quandt test is using statsmodel package as shown below, which will result in a similar conclusion.

X = data['X'].to_numpy().reshape(-1,1)
y = data['y'].to_numpy()


model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Residuals
residuals = y - y_pred

# Perform the Goldfeld-Quandt test
gq_test = het_goldfeldquandt(residuals, X, split=28)

# Display the test results
print("Goldfeld-Quandt Test:")
print(f"Test statistic: {gq_test[0]}")
print(f"P-value: {gq_test[1]}")

# Interpret the results
if gq_test[1] > 0.05:
print("The null hypothesis of homoscedasticity cannot be rejected.")
else:
print("The null hypothesis of homoscedasticity is rejected, indicating heteroscedasticity.")


Goldfeld-Quandt Test:
Test statistic: 20.661281532864027
P-value: 6.455542744783178e-24
The null hypothesis of homoscedasticity is rejected, indicating heteroscedasticity.

Homoscedasticity is an essential assumption in linear regression. If you find evidence of heteroscedasticity, it’s important to address it before drawing conclusions from your model. This could involve variable transformations, selecting a different modeling approach, or using robust regression techniques.

Thank you for joining me in this exploration, let me know in the comment section what tests and techniques you use to check homoscedasticity in your data.

--

--