Linear Regression pt.2: The Math behind it all

Anwita Ghosh
6 min readMay 21, 2023

--

The linear regression algorithm estimates the values of a continuous numeric response based on one or more predictors, while assuming an approximately linear relationship between them. When we have a single predictor to estimate the outcome, we use a process called simple linear regression, and when we have more than one predictor, we use what’s called multiple linear regression. Here, I’ve presented the Math involved with simple and multiple linear regression.

Simple Linear Regression

Like its name, simple linear regression is a simple affair: we have a continuous numeric response, and exactly one predictor to base it on. That is, the regression function takes the form:

Simple Linear Regression

We can read the above equation as ‘regressing y on X’. Here β₀ and β₁ are constants that represent the intercept and slope terms in the linear model, and are collectively called the model parameters. The goal is to fit a line that passes as closely to the data as possible.

However, in practice, β₀ and β₁ are unknown — that is, we need to estimate them from the data. That is, we make an estimate:

Predicting the response

Here, we estimate 𝛽̂₀̂ and 𝛽̂₁̂ from the training dataset — i.e. the data we use to ‘teach’ our model to identify patterns in available data, which we then use to estimate the outcome when we have new data coming in. The goal is to obtain estimates, 𝛽̂₀̂ and 𝛽̂₁̂ such that the linear model above is as close to the observations in the data as possible.

There are many ways of measuring this closeness and enforcing it, but the most common approach is to minimise what is called the least squares criterion.

The Method of Least Squares:

Suppose we have a dataset with ’n’ observations of a predictor and a response, presented as a set of pairs: {(x₁, y₁), (x₂, y₂), … , (xₙ, yₙ)}.

Let the prediction for Y based on the 𝑖ᵗʰ value of X in the dataset be:

Now, these wouldn’t be a perfect prediction for Y, which leaves us with residuals after the predictions, 𝑦̂, have been made. Suppose we call these residuals εᵢ, which gives us the difference between the 𝑖ᵗʰ observed response (i.e the actual yᵢ) and the value predicted by the linear model (i.e. 𝑦̂ᵢ). Thus, the residual for the 𝑖ᵗʰ prediction is given by:

Now, one might reasonably expect these errors to add up to zero, which underplays the difference between the actual and estimated response. This happens because 𝑦̂ᵢ overestimates yᵢ in some cases (leading to positive εᵢ) and underestimates yᵢ in others (i.e. εᵢ is negative) — and the residuals proceed to cancel each other out. Thus, we define the Residual Sum of Squares (RSS) or Sum of Squared Errors (SSE) as:

That is, we square each residual and add them up. This is done for two reasons:

  1. Squared numbers are positive by default — so, we’re not risking the residuals adding up to zero.
  2. We’re exaggerating the larger errors. That is, larger differences between the actual and predicted response show up more prominently in the residual sum of squares, thus penalising them for being that large (or penalising the predictions for being that far off from the actual response).

The method of least squares aims to choose 𝛽̂₀̂ and 𝛽̂₁̂ thaminimiseze the RSS. From some calculus, we can show that the values of 𝛽̂₀̂ and 𝛽̂₁̂ that minimise the RSS are:

Computationally, we’d be doing this with the help of a process called gradient descent, which involves trying to minimise a cost function iteratively in order to get the best values of 𝛽̂₀̂ and 𝛽̂₁̂.

These estimates, 𝛽̂₀̂ and 𝛽̂₁̂, whicminimiseze the RSS, characterise what is called the best fit line for the data — or the best description of the relationship between the response and the predictor.

Multiple Linear Regression

While Simple Linear Regression is a convenient approach for predicting a response for a single predictor, real life data contains several predictors affecting the response (sometimes running into thousands). Fitting a simple linear model on each of these predictors would consume more time and resources than a business (or individual) could realistically spend (or be willing to) — and it would be a better idea to accommodate all predictors in a single model simply by assigning each of them their own slope coefficient in a method called multiple linear regression.

Suppose we have ‘p’ predictors, X₁, X₂, … , Xₚ. The multiple linear regression takes the form:

Multiple linear regression

Suppose we represent the 𝑗ᵗʰ predictor by Xⱼ, where j = 1, 2, …, p. The corresponding slope coefficient, 𝛽ⱼ can be interpreted as the average effect of a 1 — unit increase of Xⱼ on Y, holding all other predictors fixed.

As is the case with the simple linear model, the regression coefficients 𝛽₁, 𝛽₂, … , 𝛽ⱼ are unknown and must be estimated from data. Once we have our estimates, 𝛽̂₁̂, 𝛽̂₂̂, … 𝛽̂ₚ̂, we can predict the response using the equation:

Again, the residuals — or the difference between the actual and predicted values of the response for the 𝑖ᵗʰ observation (or the 𝑖ᵗʰ row of the dataset containing n of them, i = 1,2, … n) is given by:

Error

The model parameters, 𝛽̂₁̂, 𝛽̂₂̂, … 𝛽̂ₚ̂, can be estimated the same way as in the simple linear model: we choose them minimiseize the sum of squared errors (or the RSS):

RSS

Trying to find these coefficients manually is a long, complex process that I’d rather not get into here, but computationally, the process for simple and multiple linear regression aren’t all that different (in fact, the Python code is almost the same for both — the only difference being in how many columns we specify as predictors).

Once we have estimated our parameters and predicted the response, we would like to know how accurate the model’s predictions are. There are many metrics that give us the accuracy of the predictions — the most prominent in the regression case being the 𝑅² statistic (or ‘goodness of fit’) and the mean squared error (MSE). But that’s for a later post, as is the run — through of the linear regression process on Python (don’t worry though — I have them lined right up).

--

--

Anwita Ghosh

Data Scientist in FinTech, PGP Data Science, M.Sc. Economics