Python Diagnostic Plots for OLS Linear Regression (Plots Similar to R)

Vijay Prayagala
6 min readDec 18, 2019

--

In statistical terms, linear regression is an approach of finding relationship between dependent variable (sometimes referred as Target, Label) and one or more independent variables (sometimes referred as features, explanatory variables). Incase the number of independent variable is one, then regression is called as Simple Linear Regression. It is called as Multiple Linear Regression when the processes uses more than one independent variables. The dependent variable in linear regression is a continuous numerical type.

Linear Regression is a parametric regression models because we assume the functional form that describes the dependent variable using the independent variables. As we assume the functional form, all that needs to be done for parametric model is to estimate the best values for these parameters.

The statistical model is assumed to be

Y = Xβ + μ, where μ∼N(0,Σ). μ which represents error component that can be approximated to Normal Distribution with mean 0 and variance Σ.

Depending on the properties of Σ, we have four classes available:

  • GLS : generalized least squares for arbitrary covariance Σ
  • OLS : ordinary least squares for independent and identically distributed errors Σ=I
  • WLS : weighted least squares for heteroskedastic errors diag(Σ)
  • GLSAR : feasible generalized least squares with autocorrelated AR(p) errors Σ=Σ(ρ)

All regression models define the same methods and follow the same structure, and can be used in a similar fashion. Some of them contain additional model specific methods and attributes.GLS is the superclass of the other regression classes except for RecursiveLS.

OLS :

OLS is commonly used regression method and simple method to understand relationship between dependent and independent attributes. This method can be treated as first step for studying correlations, p-values, t-statistic , coefficients and significance of attributes. Though this is simple method which makes certain assumptions, yet its most used method to understand the affect of independent attributes on dependent. Lets understand the assumptions that are made for linear regression method,

  1. Linearity: there is a linear relationship between our features and responses. This is required for our estimator and predictions to be unbiased.
  2. No multicollinearity: features are not correlated. If this is not satisfied, our estimator will suffer from high variance.
  3. Gaussian (Normal Distributed) errors: our errors are Gaussian distributed with mean 0.
  4. Homoskedasticity: errors have equal variance. If this is not satisfied, there will be other linear estimators with lower variance.
  5. Independent errors: errors are independent and identically distributed

R programming provides a plot function for linear regression models that gives all the diagnostic plots. The equivalent plots in python is easy , but tricky. It is easy to get these plots if we build OLS model using stats models than sklearn linear model.

Have wrapped the plot functions in a Class to generate these plots. The object can be created by passing x and y to this class. Fit method of the class can be used for OLS model. The diagnostic method can be used after that to generate the plots and json summary response.

Usage:

  1. Create an object using x, y

2. Use the class fit method for OLS

3. Pass this model to diagnostic_plots method to generate the plots and summary

ex, linear_plot = Plot.LinearRegressionResidualPlot(x_train.values, y_train.values), lm = linear_plot.fit() , summary, diag_res = linear_plot.diagnostic_plots(lm)

The diagnostic plots can be used to validate the if the assumptions are valid. Approaches like transformation of features, fitting polynomial regression etc to be used if assumption fails or use more flexible algorithm to fit the data.

Have used a sample toy data set available to generate the below plot. The data and source has been in github repository for reference — https://github.com/vprayagala/OLS_LR_DiagnosticPlots

  1. Linearity

The first assumption we check is linearity. We can visually check this by fitting ordinary least squares (OLS) and use that model for predicting. We then plot the residuals vs predictions. The thumb rule to look at this plot is there should not be any patterns and it plot should appear like a random plot for linear assumption to be true.

2. Multicollenarity:

This can be verified based on “cond no” value in summary of OLS output. High value indicates severe multicollenarity. Alternatively this can be verified with Variation Inflation Factor (VIF).

3. Normal Distrubution of Error:

The error terms should be normally distributed. This can be verified using the QQ plot of residual terms. The errors should follow the red line for this assumption to hold true.

4. Homoskedasticity

The error terms variance should be constant. This can be verified by plotting standardised residuals against fitted values.

5. Influence Points

This plot shows the high leverage and influential points where the models prediction differ to an extent with and without these observations. cooks distance is common measure that is used to identify high influential points

The idea of this post is to have wrapper class that can be re-used for generating diagnostic plots similar to R plot function by just passing the model object. In addition to these plots , it also returns the summary of OLS model along with some basic tests on assumptions. These can also be verified along with plots to conclude on assumptions and take necessary steps.

Sample output for this data:

Summary of Regression
: OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.703
Model: OLS Adj. R-squared: 0.702
Method: Least Squares F-statistic: 389.5
Date: Mon, 16 Dec 2019 Prob (F-statistic): 0.00
Time: 22:09:36 Log-Likelihood: -11149.
No. Observations: 2150 AIC: 2.233e+04
Df Residuals: 2136 BIC: 2.241e+04
Df Model: 13
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
const 70.9969 2.987 23.767 0.000 65.139 76.855
x1 7.5876 3.034 2.501 0.012 1.637 13.538
x2 4.4690 1.932 2.313 0.021 0.679 8.259
x3 1.5965 0.493 3.240 0.001 0.630 2.563
x4 -0.2435 0.234 -1.040 0.298 -0.703 0.216
x5 -1.7087 1.271 -1.345 0.179 -4.200 0.783
x6 85.2155 4.358 19.552 0.000 76.668 93.763
x7 61.3656 3.067 20.009 0.000 55.351 67.380
x8 5.6801 1.207 4.705 0.000 3.313 8.048
x9 -4.8651 1.348 -3.609 0.000 -7.509 -2.221
x10 -101.5185 3.226 -31.468 0.000 -107.845 -95.192
x11 43.8492 1.818 24.117 0.000 40.284 47.415
x12 27.1478 2.107 12.886 0.000 23.016 31.279
x13 25.4749 4.676 5.447 0.000 16.304 34.646
x14 29.2133 8.102 3.605 0.000 13.324 45.103
x15 16.3087 3.463 4.710 0.000 9.518 23.100
==============================================================================
Omnibus: 726.861 Durbin-Watson: 2.050
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5400.565
Skew: 1.392 Prob(JB): 0.00
Kurtosis: 10.248 Cond. №9.81e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.04e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Diagnostic Tests of Regression
:{“Non_Linearity_Test”: “Singular matrix”, “Hetroskedasticity_Test”: [[“Lagrange multiplier statistic”, 259.1734018053732], [“p-value”, 1.5954639458511352e-46], [“f-value”, 22.5214642097984], [“f p-value”, 3.870379196232425e-51]], “Residual_Normality_Test”: [[“Jarque-Bera”, 5400.565198377461], [“Chi² two-tail prob.”, 0.0], [“Skew”, 1.3915834016963757], [“Kurtosis”, 10.248404065939253]], “MultiCollnearity_Test”: [[“condition no”, 9811865318318130.0]], “Residual_AutoCorrelation_Test”: [[“p value”, 2.050087413826988]]}

References:

  1. https://en.wikipedia.org/wiki/Linear_regression
  2. https://www.statsmodels.org/stable/regression.html

--

--