Linear Regression

Published in

Let’s Deploy Data.

7 min readJul 27, 2020

Linear Regression is a method used to define a relationship between a dependent variable (Y) and an independent variable (X) which is simply written as y = mx + b

where y is the dependent variable, m is the scale factor or coefficient, b being the bias coefficient and X being the independent variable. The bias coefficient gives an extra degree of freedom to this model. The goal is to draw the line of best fit between X and Y which estimates the relationship between X and Y.

Suppose we have a dataset that contains information about the relationship between the ‘number of hours studied’ and ‘marks obtained’. Many students have been observed and their hours of study and grades are recorded. This will be our training data. The goal is to design a model that can predict marks if given the number of hours studied. Using the training data, a regression line is obtained which will give the minimum error. This linear equation is then used for any new data. That is, if we give a number of hours studied by a student as an input, our model should predict their mark with minimum error.

Y(pred) = b0 + b1*x

The values b0 and b1 must be chosen so that they minimize the error. If the sum of squared error is taken as a metric to evaluate the model, then the goal to obtain a line that best reduces the error.

Making Predictions with Linear Regression

Given the representation is a linear equation, making predictions is as simple as solving the equation for a specific set of inputs.

Let’s make this concrete with an example. Imagine we are predicting weight (y) from height (x). Our linear regression model representation for this problem would be:

y = B0 + B1 * x1

weight =B0 +B1 * height

Where B0 is the bias coefficient and B1 is the coefficient for the height column. We use a learning technique to find a good set of coefficient values. Once found, we can plug in different height values to predict the weight.

For example, lets use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in kilograms) for a person with a height of 182 centimeters.

weight = 0.1 + 0.5 * 182

weight = 91.1

You can see that the above equation could be plotted as a line in two-dimensions. The B0 is our starting point regardless of what height we have. We can run through a bunch of heights from 100 to 250 centimeters and plug them to the equation and get weight values, creating our line.

Now that we know how to make predictions given a learned linear regression model, let’s look at some of the assumptions that a linear regression model should follow.

Assumptions

There are 5 basic assumptions of Linear Regression Algorithm:

Linear Relationship between the features and target:

According to this assumption, there is a linear relationship between the features and target. Linear regression captures only linear relationships. This can be validated by plotting a scatter plot between the features and the target.

The first scatter plot of the feature TV vs Sales tells us that as the money invested on Tv advertisement increases the sales also increase linearly and the second scatter plot which is the feature Radio vs Sales also shows a partial linear relationship between them, although not completely linear.

2.Little or no Multicollinearity between the features:

Multicollinearity is a state of very high inter-correlations or inter-association among the independent variables. It is therefore a type of disturbance in the data if present weakens the statistical power of the regression model. Pair plots and heatmaps(correlation matrix) can be used for identifying highly correlated features.

The above pair plot shows no significant relationship between the features.

This heatmap gives us the correlation coefficients of each feature with respect to one another which are in turn less than 0.4. Thus the features aren’t highly correlated with each other.

Why removing highly correlated features is important?

The interpretation of a regression coefficient is that it represents the mean change in the target for each unit change in a feature when you hold all of the other features constant. However, when features are correlated, changes in one feature, in turn, shifts another feature/feature. The stronger the correlation, the more difficult it is to change one feature without changing another. It becomes difficult for the model to estimate the relationship between each feature and the target independently because the features tend to change in unison.

3.Homoscedasticity Assumption:

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables. A scatter plot of residual values vs predicted values is a good way to check for homoscedasticity. There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic.

The leftmost graph shows no definite pattern i.e constant variance among the residuals, the middle graph shows a specific pattern where the error increases and then decreases with the predicted values violating the constant variance rule and the rightmost graph also exhibits a specific pattern where the error decreases with the predicted values depicting heteroscedasticity

4.Normal distribution of error terms:

The fourth assumption is that the error(residuals) follows a normal distribution. However, a less widely known fact is that, as sample sizes increase, the normality assumption for the residuals is not needed. More precisely, if we consider repeated sampling from our population, for large sample sizes, the distribution (across repeated samples) of the ordinary least squares estimates of the regression coefficients follow a normal distribution. As a consequence, for moderate to large sample sizes, the non-normality of residuals should not adversely affect the usual inferential procedures. This result is a consequence of an extremely important result in statistics, known as the central limit theorem.

Normal distribution of the residuals can be validated by plotting a q-q plot.

Using the q-q plot we can infer if the data comes from a normal distribution. If yes, the plot would show a fairly straight line. The absence of normality in the errors can be seen with deviation in the straight line.

5.Little or No autocorrelation in the residuals:

Autocorrelation occurs when the residual errors are dependent on each other. The presence of correlation in error terms drastically reduces the model’s accuracy. This usually occurs in time series models where the next instant is dependent on the previous instant.

Autocorrelation can be tested with the help of the Durbin-Watson test. The null hypothesis of the test is that there is no serial correlation.

Evaluation Metrics

Now comes the big question!! How do we know how well our model performed after we have a line of fit?

Well, we have some evaluation metrics that help us know how well our model is performing. Following are the most important metrics that we might need:

RMSE (Root Mean Square Error)

It represents the sample standard deviation of the differences between predicted values and observed values (called residuals). Mathematically, it is calculated using this formula:

MAE

MAE is the average of the absolute difference between the predicted values and observed value. The MAE is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0. However, the same is not true for RMSE which we will discuss more in detail further. Mathematically, it is calculated using this formula:

R Squared (R²) and Adjusted R Squared

R Squared & Adjusted R Squared are often used for explanatory purposes and explains how well your selected independent variable(s) explain the variability in your dependent variable(s). Both these metrics are quite misunderstood and therefore I would like to clarify them first before going through their pros and cons.

Mathematically, R_Squared is given by:

The numerator is MSE ( average of the squares of the residuals) and the denominator is the variance in Y values. Higher the MSE, the smaller the R_squared and poorer is the model.

Adjusted R²

Just like R², adjusted R² also shows how well terms fit a curve or line but adjusts for the number of terms in a model. It is given by below formula:

where n is the total number of observations and k is the number of predictors. Adjusted R² will always be less than or equal to R².