Linear Regression, an in-depth view.

Bhanu Kiran
6 min readMar 5, 2022

--

The connection between data science and statistics is stronger when it comes to prediction. What predictions you may ask? So I’ll jump right into the topic with a simple example.

Consider this, since times have advanced now and you have automatic cars, you want to make the car learn when to press the breaks so that the car stops on time without slowing down too much or without making the passengers feel uncomfortable. You get the idea right?

In this blog, I will be explaining linear regression in hopes that you get an overview of what’s happening behind the scenes, and why we use a linear regression model. Because model selection is a big deal, this aims to give an idea as to when and where you can use a linear regression model.

Simple Linear Regression

Let’s say I have some data about the speed and braking distance of a car, and if I plot a scatter graph it would look something like this.

Scatter plot of Speed vs Distance in the inbuilt R cars dataset

At a glance we can see, an upward trend in the graph, and to identify the relation between speed and braking distance we can perform a correlation and it tells us if they depend on each other at all. But correlation only measures the strength of the relationship, whereas regression quantifies the nature of the relationship. A simple linear regression estimates the change of Y when X is changed and the simplest model of them all:

y = b0 + b1x

where y is is the response/dependent/target variable and x is the explanatory/independent/feature variable.

y = a + bx

At a glance, this might look like a line equation, and that’s exactly what it is! The two parameters b0 and b1, where b0 is the intercept and b1 is the slope/ gradient which is the change in y divided by the change in x. Mathematically the general task here is to work out the slope and intercept. To put this into perspective let us use the example of speed vs distance from the inbuilt cars dataset in R.

R output for the head() command of inbuilt cars dataset.

In R, linear regression is run by using the function lm() (lm stands for linear model). Here we use the distance as a dependent variable(target) and speed as an independent or explanatory variable(feature)

cars.lm<-lm(cars$dist~cars$speed, data=cars)
cars.lm

and this outputs a intercept of a = -17.63 and slope b = 3.90

Call:
lm(formula = cars$dist ~ cars$speed, data = cars)

Coefficients:
(Intercept) cars$speed
-17.579 3.932

Now, if I want to estimate/predict the stopping distance of a car based on a new speed value, I can just plug in the numbers accordingly. Say the new speed of the car is x= 21 then,

y = b0+b1x

y = -17.63 + 3.90*21

ŷ = 64.2 (read as y hat)

Goodness of fit

In general, a simple linear regression tries to find the best fitting line for a set of data such that there is a minimal error between the line and the points. This is where we come to the point of Fitted vs Residuals, and this is an important concept in linear regression.

Fitted values are the “predictions” and Residuals are what we call “prediction errors”. The general linear regression formula can be seen as such:

y = b0 + b1x + e

where e is the prediction error, and this term exists because we can only estimate or “predict” what the value is, and it is not likely that the value we obtain falls right on the best-fitted line.

How to calculate the residuals? Simple, we use the predicted value and we subtract it from the original value as such:

e = y - ŷ

where ŷ is the predicted value and y is the original value. And considering a set of data, we need the summation of the entire data and hence

Σe = Σ(yi — ŷi)

A few points to remember about residuals. Firstly residuals help us access how well the regression line fits the data and the best fit line is obviously with minimum error.

  • Positive residuals are data points above the line
  • Negative residuals are data points below the line

In order to treat positive and negative residuals the same way, we square the residual values. Hence we have the term residual sum of squares(RSS)

RSS = Σ (yi - ŷi)

And there you have it, Linear regression. Note that linear regression can only be used for data that has continuous variables.

In the real world, you would have multiple features/explanatory variables in the data and for that, you would use multiple linear regression. In theory, it is the same, the only difference is your b values increase as such:

y = b0 + b1x1 + b2x2 + … + bnxn + e

And the best-fitted line is obtained with the least error, and therefore you have a root mean squared error(RMSE) which is nothing but the mean of all the residuals and root squared error(RSE) which is but adjusted for degrees of freedom.

model.lm<-lm(df$target ~ df$feature1 + df$feature2 + ... + df$featuren, data=df)model.lm

Assessing the model

Whether you have followed along so far or not. you might be wondering how do I assess my model? How do I know if a model I have created for my data at hand is good or bad? And how can I be sure of it?

In R, once you have created a model you can run the following command to obtain a summary of the model:

summary(cars.lm)Call:
lm(formula = cars$dist ~ cars$speed, data = cars)

Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

Ideally, you check for 3 values:

  1. F-statistic
  2. p-value

Generally, you want your value to be greater than 0.5, closer to 1 the better the model. F-statistic should be a number as large as possible and p-value must be as small as possible to consider it is significant.

to plot your results in R you use the following command:

plot(cars.lm)

While you can see 4 plots in the output, the first 2 plots are important and enough to consider the goodness of fit, and that is what we will discuss.

Residuals vs Fitted plot of the regression model
  1. Residuals vs Fitted plot-this has to be as random as possible. If the values are not random or you find any sign of the points getting closer either from left to right or right to left, it indicates heteroscedasticity, this means that the variance of the residuals is non-uniform. This can be fixed by transforming your variables, or weighted regression.
QQ plot of the regression model
  1. the QQ plot or quantile-quantile plot must not show any signs of an “S” shape or a “banana” shaped curve.

A few things to remember when deciding to use linear regression:

  1. As mentioned before, linear regression can only be used if both the explanatory and response variables are continuous.
  2. If the explanatory variable is categorical, then we would use ANOVA from a statistical point of view.
  3. If the model seems to be performing badly, even though all the variables are continues, You have to transform your features.

And, there you have it, linear regression or hopefully a simple way to understand what’s going on behind the scenes.

So far in this blog, I have discussed linear regression in theory, and a simple R implementation to use with an example. The same can be done using python’s scikit-learn. For further reading, you could refer to the books I read from, and the whole blog is also my understanding or take on the methods discussed.

--

--