Beginner’s guide to optimize Linear Regression models.

Saiyam Bhatnagar
Analytics Vidhya
Published in
4 min readMar 20, 2020

Linear Regression is one of the most widely used statistical tool for Machine Learning problems. For those who are not familiar with what a Linear regression model is; Linear Regression is an approach to model relationships between a dependent variable and several different independent variables. Briefly, it is sough to predict an unknown variable with the help of one or more known variables. However, this article focuses on ways to create a linear regression model which is devoid of over-fitting, generalizes the training data quite well and is computationally efficient at the same time.

The readers might be thinking why this is worth pondering over. The reliability of a large machine learning model whose inputs rely on the outputs of the Linear regression can be reasonably flawed if the Linear regression over-generalizes or over-fits. To add more to the problems, a Linear regression model’s computation expense increases with the addition for explanatory variables(the Variables used for predictions). For a quick look into what Linear regression equation looks like, see the figure below.

Suppose the data-set available contains hundreds of different features and the corresponding target values to train our regression model and get estimates of coefficients(b0, b1,b2….). However, with limited computation power the conundrum of not only choosing the appropriate features but also estimating the optimal number of features to be used, prevails. Thanks to the built in python libraries like scikit learn and numpy which readily bring forth estimates for regression coefficients. Now we would dive into the methodology of choosing the appropriate features and the number of features we want to be present in our regression equation. Let our variable to be predicted be called — ‘Y’ . Our primary objective is to get the correlation coefficients of Y and each of our independent variables separately. For this we can use the corrcoef() function in the numpy library. The sample implementation is given below. Here train_set is the training data(a data frame).

iterate x in (number_of_explanatory_variables):

print(np.corrcoef(Y,train_set[‘COLUMN NAME’]))

The output will be the correlation coefficients between the dependent and each of the corresponding independent variables. Those who are unfamiliar with correlation coefficient are advised to go through the link. Now list down the variables in increasing order of the magnitude of the correlation coefficient(remember that correlation coefficient can also be negative). Prepare and pre-process these features to be a part of the regression equation. Too many independent variables will over-fit the training data and result in a not so good regression model. You can also get insights into over-fitting and under-fitting by searching about the Bias-Variance trade-off. There is also a simple figure illustrating the Bias-variance trade-off for those who are familiar yet wish to recall.

Having prepared enough features to be a part of the actual training data-set, now get ready to check whether the addition of more independent variables improves the model’s accuracy and does not over-fit it. For this we use the concept of explained variation and unexplained variation. Generally speaking, better the explained variation better is the predictive power of our model. However, we should understand that in the quest of greater predictive power we tend to over-fit our model. Now let’s learn how to estimate the predictive power of our regression model. R-squared is the ratio of the explained variation to the total variation and is always less than 1 in magnitude. Intuitively, R-square signifies the predictive capacity of our model. Hence, it is always expected to increase when we add an independent variable to be a part of the regression model. Here comes adjusted-R-square to the rescue. Adjusted-R-square is not always increasing; it is like a modified version of R-square in such a way that its value decreases if the addition of another independent variable did not improve the predictive power of the model as expected. Therefore, if the value of Adjusted-R-Square decreases on addition of an extra independent variable, our model is not improving on its accuracy of predictions. Instead it has just started over-fitting the training data-set. For more insight into Adjusted-R_square click. The computation of Adjusted-R-square can be done by using the formula below. Whereas R-square can be easily calculated by dividing the summation of squared deviation of predicted(Y) values from the actual mean value(mean(Y)) by the summation of squared deviation of actual(Y) values from the actual mean value(mean(Y)).

p is the number of parameters to be estimated

Keep on adding the independent variables to the model until you find that the value of Adjusted-r-square has eventually started decreasing. This is the point post which the addition of independent variables will invite over-fitting. Stop here and pass the selected variables in a linear regression model using scikit-learn, numpy etc. This is the proposed methodology behind the creation of a regression model that does not over-fit, generalizes well and is computationally efficient as the presence of number of features(independent variable) has also been checked.

--

--