Linear Regression in Machine Learning

Published in

Analytics Vidhya

5 min readApr 22, 2021

What is Regression?

Regression is the method used to predict the continuous variable in the target column or dependent variable based on independent features. It falls under the supervised technique. It is a statistical tool used to find out the relationship between the outcome variable, the dependent variable, and one or more variables often called independent variables.

What is Linear Regression?

Linear regression is used for finding the linear relationship between the target and one or more predictors.

Simple linear regression finds the relationship between the dependent (Y) and independent (X) and it tries to find the best fit line by minimizing the errors this fitness function says how good your model is, or you can define a cost function that measures how bad it is. Cost function measures the distances between model predictions and training data. But our objective of linear regression is to minimize the distances (or) Errors(or) Residuals. It ranges from -inf to +inf. This algorithm uses ORDINARY LEAST SQUARE (OLS).

Best fit line formula (or) Predicted formula

1) Simple Linear Regression :- Y= β0+ β1X

2) Multiple Linear Regression :- Y= β0+ β1X1+ β2X2+ β3X3+….+ βnXn

βo and β1 are two unknown constants. β1 is the slope and β0 is an intercept of Y.

P1= Original Y data point for given X

P2= Estimated y value for given X

Y Bar= Y mean

X Bar = X mean

Now let us understand the Performances of the Linear Regression Model.

Errors= Y Actual value — Predicted line

1.SSE (Residual error (or) Unexplained error): — The error is the difference between the observed value of Y and the predicted value of Y

2.SSR (Regression error (or) Explained error): — The error is the difference between the mean value of Y and the predicted value of Y

3.SST (Sum of squared error Total): — SST quantifies the variation of the observed values of Y around the mean value of Y.

4.MAE (Mean Absolute Error): — It’s the average absolute difference between the actual values and the predicted values.

5.MAPE (Mean Absolute Percentage Error): — It is the average of the ratio of the absolute difference between actual & predicted values and actual value.

6.RMSE (Root Mean Square Error): — It calculates the square root average of the sum of the squared difference between the actual and the predicted values.

7.R Square(R2): — R-square value depicts the percentage of the variation in the dependent variable explained by the independent variable in the model it may also know as the coefficient of determination. R square ranges between 0 to 1. If r square is close to 1 then model predictability is high. The value of R2 increases if we add more variables to the model irrespective of the variable contributing to the model or not. This is the disadvantage of using R2.

8.Adjusted R Square: — The disadvantage of R Square is fixed by the Adjusted R Square value. Adjusted R2 value will improve only if the added variable is making a significant contribution to the model. Adjusted R square value adds penalty in the model. Adjusted R square is always less than R square.

where R2 is R-square value, n = total number of observations, and k = total number of variables used in the model. If we increase the number of variables, the denominator becomes smaller and the overall ratio will be high. Subtracting from 1 will reduce the overall Adjusted R2. So, to increase the Adjusted R2, the contribution of additive features to the model should be significantly high.

Assumptions of Linear Regression

1. Multicollinearity

There should not be a high correlation between two or more independent variables. If there is multicollinearity present in our model, we need to drop those features step by step. Multicollinearity can be checked using correlation matrix, Tolerance and Variance Influencing Factor (VIF).

VIF = 1/1- R Square

2. Normality

When our data flows in normal distribution then we can say normality exists. If our data is not normal, we need to apply the transformation technique to the Target variable. Normality can be checked using the Shapiro test (or) Prob plot (or) Skew. Skewness should be checked with residuals. The Range should be -0.5 to +0.5 for both the Shapiro test and Skew.

3. Linearity

There should be a linear relationship between dependent and independent variables. If our data is not linear, we need to do the transformation of the dependent variable to bring it in form of linearity. Linearity can be checked using Linear rainbow from the stats library the probability value should greater than 0.5.

4. Autocorrelation

Association between two variables is called Serial Correlation. When there is repeated data then we can say that there is autocorrelation between the variables. There should not be autocorrelation. Check autocorrelation for residue.

This can be checked by doing the Durbin Watson Test from the OLS summary Report. If Durbin Watson is 2 then there is no autocorrelation if it’s approximately between 1.5 to 2.5 then it’s preferable if it’s greater than 2.5 then the presence of autocorrelation in the model.

5. Homoscedasticity

My Model should be Homoscedasticity. It can be defined as the uniform variance. This can be checked with het_goldfeldquandt from the stats library the probability value should be greater than 0.5.

Keys

· If we have categorical data, we need to convert it to dummy variables before building the model.

· No Multicollinearity

· Residuals should be normal

· Residuals should be linear

· No autocorrelation

· Presences of Homoscedasticity.

Here is the next blog for tunning the model.

Hyperparameter tuning in linear regression link Click here

References

https://www.ajo.com/article/S0002-9394(10)00167-4/fulltext

https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html