Assumptions of Linear Regression

Jobanpreet SIngh
3 min readSep 2, 2023

--

In this blog, I will cover five major assumptions of Linear Regression algorithm. If you don’t know about Linear Regression,check this blog firstly. Lets proceed further.

  1. Linear Relationship: There should be the linear relationship between input and output features.

What if assumption is violated?

  1. Apply non-linear transformations to the independent or dependent variables.
  2. Add another independent variables to the model.

Now, if we have multiple independent features,then every independent feature should have linear relationship with target feature.

2. No Multicollinearity: Lets understand what is multicollinearity?

Multicollinearity: It occurs when two or more independent features are highly correlated with each other. e.g: Suppose we have X1 and X2 are two independent features, so by changing X1 , there will be a change in X2 or vice-versa.So this multicollinearity should not be there.

What is the problem with having multicollinearity?

Since in regression model, our objective is to determine how each independent feature is impacting the target variable individually.

Lets understand with the help of example:
Y=b0+b1X1+b2X2+b3X3

X1,X2,X3 are independent features.

Mathematical significance of b1 is that if we change X1, then there will be the change in Y, keeping X2,X3 constant.Similarly with b2 and b3. But in multicollinearity if X1 changes X2 also changes if they are correlated with each other. So our assumption gone wrong, we would not be able to see the individual effects of independent features on target feature.

“This makes the effect of X1 on Y difficult to differentiate from the effect of X2 on Y”

Detecting Multicollinearity using VIF(Variance Inflation Factor)

Correlation matrix and scatterplot can be used to find multicollinearity,but their findings only show the bivariant relationship between the independent features. VIF is preferred as it show the correlation of a variable with the group of other variables.

“VIF determines the strength of the correlation between the independent variables”

R square is used to find VIF

VIF=1/1-R square

Closer the R square value to 1, the higher the value of VIF and higher the multicollinearity between features.

When VIF=1 , there is no multicollinearity. Usually VIF value between 5 and 10 indicates high multicollinearity.

Solution to multicollinearity:
1.Drop variables causing problem.

2.Use ridge and lasso regression.

3.By standardizing the variables.

4. Increase sample size may solve the problem.

3. Normal Residuals(Error): If we plot the residuals , it should be normal.

Normal distributed

What is residual?

It is difference between actual value and predicted value.

we can check it with the help of QQ Plot ,KDE plot.

4.Homoscedasticity: Homo means same and scedasticity means spread/scatter.Assumption is when we plot the residuals, their spread should be equal. It should be uniform just like below image.

It should not be like below two images.

5. No autocorrelation of Error: It mean when we plot residuals, there should be no any pattern. Graph like in below image should not be there.

--

--