Heteroscedasticity — Nothing but another statistical concept

Manoj Das
6 min readAug 31, 2021

--

Background

The term heteroscedastic (or heteroskedastic) is emerged from ancient Greek words hetero (means “different”) and skedasis (means “dispersion”). Karl Pearson first used the word in 1905 with a c spelling. The heteroscedasticity comes into use to define the absence of homoscedasticity. In 2003 the econometrician Robert Engle won the Nobel Memorial Prize for Economics for his studies on regression analysis in the presence of heteroscedasticity, which led to his formulation of the autoregressive conditional heteroscedasticity (ARCH) modeling technique.

Importance of Heteroscedasticity

An important assumption assumed by the classical linear regression model is that the error term should be homogeneous in nature. Whenever that assumption is violated, then one can assume that heteroscedasticity has occurred in the data. Breaking this assumption means that the Gauss–Markov theorem does not apply, meaning that OLS estimators are not the Best Linear Unbiased Estimators (BLUE) and their variance is not the lowest of all other unbiased estimators. So the existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance that assume that the modelling errors all have the same variance.

Heteroscedasticity in simple terms

In simple terms, heteroscedasticity is any set of data that isn’t homoscedastic. Homoscedastic refers to a condition in which the variance of the residual, or error term, in a regression model is constant. When we perform regression, the data points are scattered around the fitted line. For a good regression model, the scattering should be as minimal as possible. When the scattering is uniform, the model is called homoscedastic. If not, the model is heteroscedastic. More technically, it refers to data with unequal variability (scatter) across a set of predictor variables. A heteroscedastic distribution is typically similar to a cone shape as shown below.

Cone-shaped distribution of the heteroscedastic data

Causes of heteroscedasticity

There are many reasons why heteroscedasticity may occur in regression models. Most often the data itself is responsible for this kind of cone-shaped distribution. It has been shown that models involving a wide range of values are more prone to heteroscedasticity because the differences between the smallest and largest values are so significant. Sometimes it is very natural that the variance of the dependent variable varies and is not constant across the entire dataset.

Another cause of heteroscedasticity is due to the presence of outlier in the data. Outlier in heteroscedasticity means that the observations that are either small or large with respect to the other observations are present in the sample.

Heteroscedasticity is also caused due to omission of variables from the model. If a very important variable is deleted from the model, then the researcher would not be able to interpret anything from the model.

Types of heteroscedasticity

Heteroscedasticity can be categorized into two general types —

  • Pure and
  • impure heteroscedasticity.

Pure heteroscedasticity refers to cases where after specifying the correct model and yet we observe non-constant variance in the residual plots.

Impure heteroscedasticity refers to cases where after specifying the model incorrectly, and that causes the non-constant variance.

When we leave an important variable out of a model, the omitted effect is absorbed into the error term. If the effect of the omitted variable varies throughout the observed range of data, it can produce the telltale signs of heteroscedasticity in the residual plots. While observing heteroscedasticity in the residual plots, it is important to determine whether we are dealing with pure or impure heteroscedasticity because the solutions are different.

Detection of Heteroscedasticity

Visual approach:

A typical fitted value vs. residual plot in which heteroscedasticity is present
A typical fitted value vs. residual plot in which heteroscedasticity is present
  1. The simplest way to detect heteroscedasticity is with a fitted value vs. residual plot. Once a regression line is fitted to a set of data, we can then create a scatterplot that shows the fitted values of the model vs. the residuals of those fitted values.
  2. Residual plot can suggest heteroscedasticity (but can not prove). Residual plots can be created by:
  • Calculating the square residuals.
  • Plotting the squared residuals against an explanatory variable (one that is related to the errors).
  • Make a separate plot for each explanatory variable you think is contributing to the errors.
use box plots to depict the conditional distributions of the residuals

Statistical approach: There are several methods to test for the presence of heteroscedasticity. Although tests for heteroscedasticity between groups can formally be considered as a special case of testing within regression models, some tests have structures specific to this case. These tests consist of a test statistic (a mathematical expression yielding a numerical value as a function of the data), a hypothesis that is going to be tested (the null hypothesis), an alternative hypothesis, and a statement about the distribution of statistic under the null hypothesis.

Tests in regression

  • Levene’s test
  • Goldfeld–Quandt test
  • Park test
  • Glejser test
  • Brown–Forsythe test
  • Harrison–McCabe test
  • Breusch–Pagan test
  • White test
  • Cook–Weisberg test

Tests for grouped data

  • F-test of equality of variances
  • Cochran’s C test
  • Hartley’s test

Fixes for heteroscedasticity

These are some common corrections for heteroscedasticity:

  • One way to fix heteroscedasticity is to transform the dependent variable in some way. One common transformation is to simply take the log of the dependent variable.
  • Another way to fix heteroscedasticity is to redefine the dependent variable. One common way to do so is to use a rate for the dependent variable, rather than the raw value.
  • View logarithmized data. Non-logarithmized series that are growing exponentially often appear to have increasing variability as the series rises over time.
  • Use a different specification for the model (different X variables, or perhaps non-linear transformations of the X variables).
  • Apply a weighted least squares estimation method, in which OLS is applied to transformed or weighted values of X and Y. The weights vary over observations, usually depending on the changing error variances. In one variation the weights are directly related to the magnitude of the dependent variable, and this corresponds to least squares percentage regression.
  • Heteroscedasticity-consistent standard errors (HCSE), while still biased, improve upon OLS estimates. HCSE is a consistent estimator of standard errors in regression models with heteroscedasticity. This method corrects for heteroscedasticity without altering the values of the coefficients. This method may be superior to regular OLS because if heteroscedasticity is present it corrects for it, however, if the data is homoscedastic, the standard errors are equivalent to conventional standard errors estimated by OLS. Several modifications of the White method of computing heteroscedasticity-consistent standard errors have been proposed as corrections with superior finite sample properties.
  • Use MINQUE (minimum norm quadratic unbiased estimation).

Heteroscedasticity vs Homoscedasticity

Heteroscedasticity vs Homoscedasticity

When the residuals are observed to have unequal variance, it indicates the presence of heteroskedasticity. When the residuals have constant variance, it is referred as homoscedasticity. Homoscedasticity refers to situations where the residuals are equal across all the independent variables.

A simple example of Heteroscedasticity

The relationship between food expenditures and income

For those with lower incomes, their food expenditures are often restricted based on their budget. As incomes increase, people tend to spend more on food as they have more options and fewer budget restrictions. For wealthier people, they can access a variety of foods with very few budget restrictions.

Therefore, there is a greater variance in food expenditures of wealthier people relative to lower-income individuals. In such a situation, the variance of the residuals is unequal across the independent variable (income). If one were to run a regression using this dataset, one would find the presence of heteroscedasticity.

So, heteroscedasticity is also nothing but another concept which is floating in the ocean of statistics.

--

--