All about Linear Regression

10 min readOct 4, 2016

This blog post is intended to the readers interested to know the mathematics behind Linear Regression and also best practices to build the Linear Regression model.

Linear Regression is a simple but useful supervised learning statistical technique to predict a quantitative variable (dependent variable {Y}) by forming a linear relationship between one or more independent variables {X}.

Hypothesis function is the approximated relationship between a dependent variable and the independent variables.

We are trying to fit our hypothesis function which can approximately estimate the dependent variable with respect to the independent variables.

We minimize the squared error to find out regression coefficients so as to fit f(X) as accurately as possible.

Reducible error can be of two kinds. Error due to variance and error due to bias.

Error due to Variance refers to the amount by which the estimation of hypothesis function changes with a change in independent variables. It is recommended that the variance is minimum across data sets. The least square estimates will have low variance if n >> k.

n — number of observations in the training data

k — number of predictors

Bias is the error introduced by fitting a complex model to a simple linear fit. The least square estimates will have low bias if the true relationship between response and predictors is approximately linear.

Linear regression is considered to be the best fit at a point where the error due to variance and the error due to bias is minimum. This point is often referred to as the sweet spot. Bias and variance trade off concept comes in to picture in this scenario. I will discuss about the trade off concept in my future posts.

For Simple Linear Regression:

One can also use linear algebra to estimate the coefficients

Regression coefficients can also be obtained by an optimization algorithm called Gradient Descent.

Gradient Descent algorithm uses the same squared error function but in a different way to arrive at the coefficients.

This cost function corresponds to the batch gradient descent, where at each iteration cost function is calculated using all the observations present in the training set. Depending on the size of training set, if large, m can be decreased to a fraction (mini gradient descent) or to 1 observation at a time (stochastic gradient descent or online learning) for the algorithm to converge.

Algorithm:

Notice that the cost function for linear regression is always in the form of a quadratic function. Hence, it is a convex function and so the algorithm will always converge to a global minimum, unlike other complex models where gradient descent algorithm can stuck at local minimum. However, choosing the right value of learning rate is critical. If learning rate is too small, gradient descent can be too slow to converge. While if it is too large, gradient descent can overshoot the minimum and may fail to converge or even diverge. Recommended approach is to start at 0.001 and move 3 folds forward and choose the best learning rate.

Interpretation of the regression coefficients:

These standard errors are used to estimate the population regression coefficients also for statistical tests on the coefficients.

Null hypothesis: There is no relationship between X and Y

Alternate hypothesis: There is some relationship between X and Y

We compute a statistic measure and look for the evidence against the null hypothesis. In practice, we use t-statistic to understand the significance of individual coefficients.

t-statistic greater than 2 or p-value corresponding to the t-statistic is less than a significance value of 0.05 gives a strong evidence against the null hypothesis. Hence, rejecting the null hypothesis, which implies that there exists a relationship between the X and Y.

For multiple linear regression, we conduct an additional statistical test on coefficients using F-statistic.

F-statistic needs to be large (>1) and corresponding p-value less than 0.05 to reject the null hypothesis.

m is the number of observations and k is the number of independent variables.

Handling Qualitative Variables:

Most of the python libraries and R libraries, take only numeric objects as input. When there are categorical variables, be it ordinal or nominal, they need to be properly encoded before they can be given as an input to algorithm.

If there are n levels in the categorical variable, n-1 dummy variables need to be introduced to avoid multicollinearity. The process for encoding categorical variables will be discussed in detail in practical section.

Model Assessment:

The quality of the linear regression fit is assessed using two quantities residual standard error (RSE) and R-squared statistic.

RSE is an estimate of standard deviation of the error and can also be defined as the lack of fit.

Adjusted R-squared adjusts the statistic based on the number of independent variables in the model. R², the coefficient of determination can be calculated using the correlation between the variables depending on whether it is simple linear regression or multiple linear regression.

As we can see that R² increases automatically as the number of independent variables increases(overfitting), even if we have adjusted-R² to account for this, R² may not be the right metric at all times. Hence, we have to validate and choose the best model using the following techniques.

Model Assumptions and validation:

Relationship between the predictors and the response variable is additive and linear.

Additive: The effect of changes in a predictor X on response Y is independent of the values of the other predictors.

To handle the additive assumption, synergy or interaction between the variables can be introduced which can capture the fit of response variable better. It is recommended to include the variables X1 and X2, even if their interaction is introduced X1*X2 or X1² or X2².

The concept of interaction is also applicable to qualitative variables or a combination of qualitative and quantitative variables.

Linear: The change in Y due to unit change in X is constant.

Strong pattern in residual plots indicate non-linearity in the data.

In case of patterns in residual plots, non-linear transformations of predictors may improve model in capturing the non-linear relationship.

2. Error terms should be independent and must follow normal distribution, so that the standard errors that are computed in hypothesis testing are not underestimated.

Durbin-Watson statistic is used to check the independence or auto correlation (correlation between error terms) between residuals. Check this link for more on Durbin-Watson statistic.

Durbin-Watson statistic lies between 0 and 4

0–2 →+ve auto correlation

2 →No auto correlation

2–4 →-ve auto correlation

QQ-plot (theoretical quantiles vs actual quantiles)is used to check whether the errors follow normal distribution or not. Check link1 or link2 for more on QQ-plot.

3. Error terms should have constant variance (Homoscedasticity).

Residual plots are used to check for homoscedasticity.

Heteroscedasticity can be handled by making transformation on the response variable Y (Ex: logY or sqrt(Y)) as the shrinkage in the variance causes reduction in the heteroscedasticity.

4. Multicollinearity

If two or more predictor variables are related to each other, it is difficult to separate the individual contribution towards the target variable because of the additive assumption being applied in regression setting.

Correlation matrix is used to check the collinearity between two predictor variables. However, correlation matrix cannot be applied to check the multicollinearity between 3 or more variables. Hence, Variance Inflation Factor is used to assess Multicollinearity.

Solution to handle multicollinearity can be to remove some of them or combine them into a single predictor (Average).

Recommendations to improve the performance of the model:

Performing Mean Normalization to the predictors helps gradient descent to converge faster.

2. If the distribution of predictors are skewed, the coefficients estimated will also be skewed. Hence, it is a recommended practice to check for skewness in numerical predictors and making it symmetric to obtain improved results.

3. Outlier removal

An outlier is a point for which the actual response value is far from the predicted value by the regression model. Regression line does not have much effect on non removal of outliers, but affects the interpretation of the model as RSE and p-value corresponding to the model changes drastically.

Residual plots are used to detect the outliers. Most of the statistical tools uses the residuals to detect outliers.

4. High leverage observation removal

An outlier has an unusual value of response, where as a high leverage observation has an unusual value of the predictor value. This has a substantial effect on regression fit, which in turn affects the coefficients.

5. Influential observation removal

Influential observation is one which has the combined effect of both an outlier and a high leverage point. Cook’s Distance is calculated to determine the influential observations.

Cook’s distance higher than 1 are considered to be as influential.

Model Diagnostics:

Test for overall model (R² and adjusted-R²).
Test for statistical significance of overall model (F-statistic and it’s p-value).
Test for statistical significance of coefficient of each predictor (t-statistic and it’s corresponding p-value).
Test for normality(QQ-plot) and homoscedasticity(residual plot) of residuals and auto correlation(Durbin-watson statistic).
Test for multicollinearity.
Test for outliers, high leverage points and influential observations.

Model Selection:

Subset selection: Out of k predictors, only a subset of predictors might contribute to the Linear regression model and rest all could be noise to the data which might actually underestimate or overestimate the errors.

To select the best subset of variables, test error needs to be estimated either by directly estimating the test error through cross validation or by indirectly making an adjustment to the training error using the following metrics as discussed below.

Metrics to select the best subset:

Mallow’s Cp: Estimate of the size of bias introduced into the predicted response. Lower the Cp, better the model.
Akaike information criterion (AIC): Amount of information lost due to the predictions on the response variable. AICs’ main goal is to build model that effectively predicts the response variable.
Bayesian information criterion (BIC): BIC is similar to AIC, but BIC penalizes the noise predictors. BICs’ main goal is to extract the features that are actually influencing the response variable.
Adjusted-R²: Proportion of the response variable explained by the independent variables.

Best subset selection: Consider all possible combinations of predictors and estimate the test error indirectly by using the above mentioned metrics (Cp, AIC, BIC, R²) or directly through cross validation and choose the best subset which has the least error. This method works when there are fewer variables as selection gets cumbersome if number of predictors are high (Exhaustive search).
Forward stepwise selection: Start with one variable with least test error and keep adding one predictor at a time with out removing any variable after a variable is added until all the variables are exhausted or the test error does not decrease.
Backward stepwise selection: Start with all the variables and remove one predictor at a time which results in least test error with out adding any variable after removal until the test error does not decrease.
Hybrid stepwise selection: Combination of forward and backward stepwise selections where in at a particular step a variable can be dropped or added to improve the performance on the test data.

Shrinkage or Regularization:

Regularization is a technique which is used to control overfitting by reducing the variance. For instance, there could be variables which does not actually contribute to the response variable, but their coefficients significantly change the response as a part of the hypothesis function.

Ridge regression is best applied after standardizing the predictors. As regularization parameter lambda increases, the coefficients tend to zero. As a result, flexibility increases variance decreases and bias increases. However, Ridge regression does not exclude the variable completely as ridge assumes the coefficients are randomly distributed about zero and hence, interpretability decreases.

Lasso Regression:

In lasso, coefficients close to zero will become exactly zero depending on the value of regularization parameter. Hence, interpretability increases.

Use cross validation to arrive at the best value of regularization parameter in both Ridge and Lasso regression.

I will be using House prices knowledge competition data to illustrate the power of Linear Regression through best practices and best techniques along with visualizations. (Yet to be uploaded)

References:

Machine Learning created by Stanford University taught by Andrew N.G at coursera.
Introduction to Statistical Learning by Trevor Hastie and Robert Tibshirani.

All about Linear Regression

Written by Supreeth Manyam