Linear Regression -Pros & Cons

Satyavishnumolakala
3 min readJun 12, 2020

--

Pros & Cons of the most popular ML algorithm

Linear Regression is a statistical method that allows us to summarize and study relationships between continuous (quantitative) variables. The term “linear” in linear regression refers to the fact that the method models data with linear combination of the explanatory/predictor variables (attributes).

Here are some Pros and Cons of the very popular ML algorithm — Linear regression:

Pros

Simple model : The Linear regression model is the simplest equation using which the relationship between the multiple predictor variables and predicted variable can be expressed.

Computationally efficient : The modeling speed of Linear regression is fast as it does not require complicated calculations and runs predictions fast when the amount of data is large.

Interpretability of the Output: The ability of Linear regression to determine the relative influence of one or more predictor variables to the predicted value when the predictors are independent of each other is one of the key reasons of the popularity of Linear regression. The model derived using this method can express the what change in the predictor variable causes what change in the predicted or target variable.

Cons

Overly-Simplistic: The Linear regression model is too simplistic to capture real world complexity

Linearity Assumption: Linear regression makes strong assumptions that there is Predictor (independent) and Predicted (dependent) variables are linearly related which may not be the case.

Severely affected by Outliers: Outliers can have a large effect on the output, as the Best Fit Line tries to minimize the MSE for the outlier points as well, resulting in a model that is not able to capture the information in the data.

Independence of variables :Assumes that the predictor variables are not correlated which is rarely true. It is important to, therefore, remove multicollinearity (using dimensionality reduction techniques) because the technique assumes that there is no relationship among independent variables. In cases of high multicollinearity, two features that have high correlation will influence each other’s weight and result in an unreliable model.

Assumes Homoskedacity :Linear regression looks at a relationship between the mean of the predictor/dependent variable and the predicted/independent variables and assumes constant variance around the mean which is unrealistic in most cases.

Inability to determine Feature importance :As discussed in the “Assumes independent variables” point, in cases of high multicollinearity, 2 features that have high correlation will affect each other’s weight. If we run stochastic linear regression multiple times, the result may be different weights each time for these 2 features. So, it’s we cannot really interpret the importance of these features.

In summary, despite all its shortcomings , the Linear regression model can still be a useful tool by using regularization (Lasso(L1) and Ridge(L2)), doing data preprocessing to handle outliers and dimensionality reduction to remove multi-collinearity for preliminary analysis. Due to the easy interpretability of the linear model makes it widely used in the field of Statistics and Data Analysis.

--

--