Assumptions of Linear Regression

Published in

Analytics Vidhya

4 min readMay 18, 2020

Linear Regression is a standard technique used for analyzing the relationship between two variables. It is a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).

In this blog we are going to learn about some of its assumptions and how to check their presence in a data set.

Linear Distribution
Presence of Normality
MultiCollinearity
Auto Correlation
HomoScedasticity

Let’s discuss them one by one:

1. Linear Distribution: It is defined as a relationship between two features where change in one feature can easily explain change in another feature i.e relationship between each independent variable and target variable should be linear and to check for linear distribution we can simply plot a scatter plot.

2. Presence of Normality: As we know there are N number of distributions in statistics and if the number of observations is greater than 30 for any variable then we can simply assume it to be normally distributed(Central Limit Theorem).

Presence of Normality simply means that all the features that will be a part of the “X” feature matrix should obey a normal distribution and to check its presence we can use a Histogram.

3.MultiCollinearity: It is defined as the correlation between features used for regression analysis. It is a measure of correlation among all the columns used in the “X” feature matrix.

For a good regression analysis, we don’t want the features to be heavily dependent upon each other as changing one might change the other. We need very little or no multicollinearity and to check for multicollinearity we can use the Pearson’s correlation coefficient or a heatmap.

4.AutoCorrelation: It can be defined as correlation between adjacent observations in the vector of prediction(or dependent variable). Sometimes the value of y(x+1) is dependent upon the value of y(x) which again depends on the value of y(x-1). Mostly stock Market or any Time-Series analysis dataset can be counted as an example of auto-correlated data and we can use line plot or geom plot to check its presence.

5. Homoscedasticity: (Homo = similar | scedasticity = error) It can be defined as a property of regression models where the errors (“noise” or random disturbance between input and output variables) are almost similar across all values of the input variables.

If the errors keep changing drastically, this will result in a funnel shaped scatter plot and can break our regression model and condition follows Heteroscedasticity and we can use scatter plot to check its presence in the dataset.

Now, let”s just understand them one by one diagramatically

Consider a dataset having three features and one target variable.