Assumptions of Linear Regression and Easy Ways to Test Them

Pouria Salehi
Human Systems Data
Published in
2 min readMar 1, 2017

Although this chapter smoothly describes various benefits of linear regression with a bunch of examples, such as predicting, effect size, best predictor, and synergy effect, it does not reflect the required assumptions of linear regression. It is really important to know that under which conditions we can use linear regression. For example, as we know, in linear regression dependent variable need to be quantitative. Otherwise, we cannot employ linear regression method. There are some other criteria which our data set has to meet before we can confidently proceed and use linear regression’s result. Otherwise, the model we have fitted may not be valid. In this post, we are going to learn about those requirements, and the way we can test them.

According to the website of Duke University, we have four key assumptions which legitimize the utilization of linear regression models in order to inference or predict. Surprisingly, all of them can be tested by Tukey’s exploratory data analysis tools. Notice that there are more assumptions, but let’s start with these four important ones.

(1) Normal distribution of residuals
Errors should be distributed normally. It is easy to test the normality of residuals. First we need to calculate them based on the outcome equation. Then we should draw a histogram of them and check its skewness. If the skewness is between -2 and +2, we are good to go to the second assumption. Although, there are some more formal tests and graphs embedded in the software packages like JMP, Design-Expert, Minitab, SPSS,… one can employ, I tried to mention some rule of thumb to ease the tests.

(2) Linearity of the Relationship between Variables
The relationship between the dependent variable and the independent variable should be linear like a straight line, not like a curve. To examine this issue, we are going to deploy another tool from exploratory data analysis: scatter plot. We should draw a scatter plot of residuals and y values. If the plot follows a linear pattern, then we can conclude that this condition is met. In SPSS you can run this using ZRESID plots. Otherwise, we may face lack of fit by just simply fitting a line while we have significant curvature.

(3) Statistical Independence of the Errors
We think about this assumption only if we are conducting a longitudinal study. If in our research we collect our data only once, then our research is cross sectional and we should not worry about the third requirement of linear regression.

(4) Constant Variance or Homoscedasticity
To investigate this assumption, we may use the same scatter plot from linearity (ZRESID and y values). It is a little difficult to figure it out and it needs more experience. However, the rule of thumb is if we can put all the dots into a triangle like a funnel, then we might have a problem.

If any of these assumptions is violated, we still might have a chance to fix them by using some statistical techniques like transformation and …. For more information please refer to these pages and videos:

https://goo.gl/iu2NQk
https://goo.gl/migimc
https://goo.gl/JRRqn3

--

--