Clarifying Regression Diagnostics for Linear Regression
How to test the required assumptions
Regression diagnostics are a series of regression analysis techniques that test the validity of a model in a variety of ways. These techniques can include an examination of the underlying mathematical assumptions of the model, an overview of the model structure through the consideration of formulas with fewer, more, or unique explanatory variables, or an analysis of observation subsets, such as searching for those which are either badly represented by the data, like outliers, or that have a reasonably large effect on the predictions of the regression model. In this blog post, I’ll show you some of the approaches/tests that you can use for regression diagnostics.
A measure for normality is the Jarque-Bera, or JB, test. This measure is generally used for large sets of data, since other measurements, such as Q-Q Plots (which will be discussed shortly), may become inaccurate when the data size is too large. To see whether it fits a normal distribution, the Jarque-Bera test inspects the skewness and kurtosis of results, and is a common approach to inspect the distribution of errors in regression.
In Python, the JB test can be used using statsmodels. A JB value of approximately 6 or greater implies that errors are not normally distributed. This is determined because, assuming you have an alpha = .05, the JB score is greater than your alpha, meaning that the normality null hypothesis has been dismissed. In comparison, a value near 0 means that the data is normally distributed.
Q-Q Plots are also used as a measure to verify normality. When used for regular, normal quantiles, Q-Q plots are also called normal density plots. These plots are a good way for model error distribution to be inspected. Normal Q-Q plots are a valuable visual evaluation of how well your residuals represent what you can anticipate from a normal distribution. From Q-Q plots, skew, heavy & light-tailed distributions, and outliers, all of which are failures of normality, can be analyzed. Below is an example of a Q-Q plot from one of my projects (and not a very good one!)
The Goldfeld-Quandt, or GQ, test is used in regression analysis to search for heteroscedasticity in the error values. Heteroscedasticity is seeing if there is different variance for two groups. It tests whether a value that can be used to distinguish the variance of the error term can be specified. It’s a parametric test that uses the presumption that the data is distributed normally. This means that it is standard procedure to test for normality before going over to the GQ test. Statsmodels provides the ability to run this test as well.
When determining the significance of the results of the GQ test, you will be observing the F-statistic, keeping in mind that homoscedasticity is the null hypothesis. Typically, high F values suggest that the variances differ. The greater the F-statistic, the more proof we have against the assumption of homoscedasticity, and the more likely we are to have heteroscedasticity. There would be no systemic differences between residuals if the error term is homoscedastic, and F values would be low. The p-value for the test(s) indicates whether or not to dismiss the homoscedasticity null hypothesis. A very common alpha level to reject null hypotheses is 0.05.
Thank you for reading! I hope this helped better understand the different tests that must be performed when checking the required assumptions for linear regression.