Demystifying the Math Behind Linear Regression

A Step-by-Step Guide

Jason Hayes
17 min readJan 13, 2023
Image by DALL-E 2

I’m sure we all somewhat remember the equation of a line y=mx+b. This equation is asking “if I take my input ‘x’ and multiply it by a constant ‘m’ (slope) then add a constant ‘b’ (y-intercept), what is the output?” If I do that for all numbers 1–10, where m=1.5 and b=2, and plot all the answers, I get the figure below:

y=mx+b plot

Looks like a line to me. Where solving for y is the process of applying the equation of a line to x where both m and b are known, Regression is the process of solving for m and b when x and y are known.

Regression

Regression is an algorithm (equation) that is applied iteratively (repeatedly) until a value is either maximized or minimized. Continuing with our example above, what if we only knew x and y, our input and output (answer), and wanted to find m and b, the slope and y-intercept parameters, in order to in the future map any value of x to a y? The way this is done is by repeating the following process:

  1. Guess what the values of m and b may be (most of the time these numbers are chosen randomly).
  2. Make predictions using those guesses.
  3. Calculate how wrong each prediction is. This is called the residual and is calculated using a ‘Loss’ function.
  4. Calculate the total error. This measures how wrong the predictions were over the entirety of the input data and is calculated by combining all residuals. The total error is known as the ‘Cost’ in Machine Learning.
  5. Correct the values of m and b by taking a step in the direction towards which the value of the cost (total error) is decreasing. This is done through a process known as Gradient Descent, which I will outline below.
  6. Repeat steps 2–5 until the loss stops decreasing or a stopping condition is met (ie. the loss decreased less than a set threshold value).

Let’s walk through each of these steps using the example above to solve for m and b using x and y.

Guess what the values of m and b may be

This is simple. Just choose any random number for each. For this example, I will begin with m=0.5 and b=0.25

Make predictions using those guesses

For each value in x (1–10), run it through the equation of a line to get the prediction. (note: ^ is used to signify the prediction of a variable, so y with ^ over it means the predicted values of y):

Running each value of x through y=mx+b using the values of m and b above, the values of y^ are:

Plotting the values of y and y^ over x produces the following graph:

Prediction vs. Actual Values

As you can see, our predictions are pretty far off. Let’s calculate how far off by computing the residual values

Calculate how wrong each prediction is

The residual value of each prediction is calculated using the following formula:

Plotting both y and y^ over x and including the residual line for each data point produces the following graph:

Adding residuals to prediction plot

The loss function (error per datapoint) used will be the Squared Error. There are others, but this function is common when using regression. It’s simply the square of each residual.

When applied over all values of y and y^, the residual and squared error calculation is as follows:

Calculate the cost (total error)

There are a number of ways to calculate the cost, the most common of which is the “Mean Squared Error”, abbreviated as MSE, which simply averages all the squared errors. To set a good foundation in order to understand what the Mean Squared Error is actually calculating, I’ll slowly add elements of the equation until the entire Mean Squared Error formula is written.

The cost can be initially written as follows:

This is a good start, but what if we had more data points? The cost would grow because we would have to add in more residual values. Does this mean that it’s more wrong? No! The cost only grew because more data points were added. This means that we are only able to compare cost values for datasets of equal length. To correct this, we will take the mean (average) of the residuals and set that as the cost value.

All the above is doing is taking the sum of the residuals and dividing it by the total number of residuals to get the average residual value.

Lastly, we will square the residual before summing them all up and getting the mean. Why? Two reasons…

First, residuals can be negative, and the sum of an identical negative and positive number is 0 (-2+2=0). Does this mean there was zero error? No! It means the errors summed to zero. A more accurate answer would be to use the absolute value of each residual in the summation. This is called the Absolute Error.

example:

Secondly, if I square a number, I get a larger number. If I square a larger number, I get a much larger number. As a number increases, its squared value increases at a faster rate. This is called “Exponential Growth” and is illustrated in the plot below.

Exponential Growth

By squaring each residual, not only are we ensuring a positive number (any positive or negative number squared is a positive number) we are also making every residual larger, and every larger residual much larger. These much larger residuals contribute more when calculating the cost. By doing this, the cost grows exponentially, which causes larger corrections for the values of m and b to be taken. Calculating the mean (average) of the squared absolute residuals gives us the Mean Squared Error (MSE)

Notice how much larger this value is compared to our original AVG cost value.

The plot below demonstrates how as the average cost grows, the MSE grows exponentially.

Avg cost vs. MSE

The algorithm we are slowly assembling will now correct more aggressively if we have larger residual values (errors) in our predictions. Now let’s move on to discuss how we will make those corrections.

Correct the values of m and b by using Gradient Descent

The formula that enables an algorithm to gradually correct itself is called Gradient Descent. This will be the most challenging part of understanding how linear regression works but is crucial as it is the foundation on which an algorithm can correct itself toward a minimum or maximum value.

In order to comprehend Gradient Descent, we must first discuss what a derivative is. A derivative calculates the slope, or rate of change, for any input x as it relates to output y if x was increased or decreased by a very small amount. I know that might sound confusing, but allow me to demonstrate.

Let’s say we have the following function:

and we want to find the value of x that minimizes the output of f(x). Intuitively we know this value is 0:

Pretend we didn’t know that and wanted to solve for it. We would start by randomly selecting a value for x. For simplicity, let’s guess 2.

Now we pick a value very close to 2. Let’s pick 2.1.

Recall the equation to calculate the slope between 2 data points:

Plugging in our numbers for x and y we get the following:

This tells us that as I increase x 0.1 from 2, my y increases 0.41 from 4, giving me a rate of change (slope) of 4.1. Plotting this data produces the graph below:

Slope at x=2

What happens if x=1?

Slopes for x=1&2

The slope decreased (flattened), which means the rate of change decreased when moving from 1 to 1.1 as opposed to moving from 2 to 2.1. Let’s reduce x by 1 again and see what happens.

Slopes for x=0–2

Our slope is essentially flat, which means the change from x=0 to x=0.1 barely changes y.

To explore a wider range of x, let’s calculate and plot the data points and slopes for when x = -5 to +5:

Slopes of function for x=-5–5

Not exactly the most comprehensive plot, but if you look closely, you can see the slope is flat when x=0 (brown line), which is the minimum value this function produces. All other slopes point in the general direction of this minimum. What’s the takeaway? A slope will generally point towards the extremes of a function (maximum or minimum) and will flatten as you approach an extreme. I say generally because more complicated polynomial functions can have multiple peaks and valleys, so the slope may be pointing towards a local maximum or minimum instead of the actual (global) maximum or minimum. Just something to keep in mind as you will inevitably run into that situation in the future as you get deeper into this subject.

Wouldn’t it be nice if there was a way to calculate the slope at any point in a function without having to go through all the math we did above? It turns out that is what derivatives allow us to do.

Derivatives

Derivatives allow us to calculate the rate of change (slope) of a function at any point in that function. Said a different way, given any input value x for a function f(x), we are able to calculate the rate of change (slope) of the function f(x) at the input value x. This is exactly what we did above, but we will arrive at the same answer using the concept of derivatives.

Finding the derivative of a function involves following the rules of derivatives. There are a few, but in the case of our function above…

we only need the ‘Power Rule’. The Power Rule states that if an exponent appears in the function, move the exponent to the front of the function as a multiplier and then decrease the exponent by 1.

The exponent ‘2’ is moved to be in front of the variable x. The exponent was decreased by 1 making it 1. Since anything to the power of 1 is itself, it is removed from the solution.

Let’s apply this to our example above:

An astute eye will notice a discrepancy between the solutions above and our original solutions.

The difference stems from the step size we used to calculate the slope in our original solution. Remember, we moved 0.1 from 2.0 to 2.1. What if we moved only 0.01 from 2.0 to 2.01?

We are slowly moving closer to 4. This will continue as we decrease the step size.

What does this mean? With each smaller step we take from our datapoint, we are approaching the actual slope (rate of change) at that datapoint, 4, which we determined using derivatives.

Derivatives are one of the more confusing subjects of math to truly wrap your head around. If you are interested in understanding them at a deeper level, I highly encourage you to watch this video from the 3Blue1Brown Youtube channel below. He does an amazing job of explaining and visualizing what a derivative is.

Let’s use our knowledge of derivatives to help us solve for the values of m and b in our line equation (y=mx+b).

Recall where we left off:

Guess what the values of m and b may be

Make predictions using those guesses

Calculate how wrong each prediction is

Calculate the cost (total error)

Correct the values of m and b by using Gradient Descent

The goal is to find the correct values for m and b. If we knew the correct values for both variables, then the cost would be 0. Therefore, what we are trying to do is minimize the cost so we can find the correct values of m and b. This is accomplished by finding the derivative of the cost function (MSE) and using it to calculate the rate of change (slope) at each input (x) so we can take a step down the slope towards the minimum.

To simplify the derivation of the MSE, let’s rewrite the equation as a version seen previously above:

Next, we’re going to replace ‘residual’ with the equation to solve for the residual.

To further simplify, let’s replace y^ with what it represents, the equation for a line:

With the variables we are solving for (m and b) now present in the MSE equation, we can solve for the derivatives (rates of change) of the MSE with respect to m and b.

Starting with respect to m:

The summation and constant (1/n) can be moved outside the derivation:

To complete the derivative, we need to use the ‘Chain Rule’. The chain rule states that if a function is made up of another function, you can find the derivative of the whole with respect to a variable by multiplying together the derivatives of each function with respect to that variable. The animation below shows the chain rule and how it is used to find the derivatives of the MSE.

Use the Power Rule to solve for the derivative of k with respect to j:

A couple more derivative rules before moving forward. First, the derivative of a variable multiplied by a constant is the constant.

Second, all other values not affected by the variable we are solving with respect to will evaluate to 0.

Using the rules above, we can solve for the derivative of j w.r.t. m:

Multiply the two derivatives together to solve for the derivative of k w.r.t. m:

Write out j in terms of what it equals, y-(mx+b):

Move the negative x to the beginning of the statement for simplicity:

Now solve with respect to b:

Don’t forget the summations we moved outside the derivations.

For m:

And b:

Done with derivatives! The math got a bit intense toward the end, but once you understand it, you will have unlocked the foundation on which machine learning is built. This concept can be applied to almost any algorithm. As long as you can identify a way to measure how wrong an algorithm is (cost), you can use derivatives to gradually reduce the cost and find the correct coefficient (m) and constant (b) to apply to each input (x).

The only thing left to do is iterate over the algorithm above and set how variables m and b will update after each iteration. Since we are solving for variables m and b which produce a 0 cost (error), and the derivatives of m and b are the slopes (rates of change) at input x that point down towards the minimum, we will update the values of m and b at each iteration by taking a step down the slope. This process is known as ‘Gradient Descent’. We find the derivates of the cost (error) function with respect to each coefficient (m, b). These derivatives are referred to as the gradient. A step is then taken down the gradient towards the minimum. The size of that step taken is called the ‘Learning Rate’. If the steps taken are too large, the algorithm will correct faster, but may never find the true minimum because the steps are continuously overstepping the minimum. If too small, the algorithm won’t overstep a minimum but will take longer to converge (solve) and runs the risk of getting stuck in a local minimum (not the true/global minimum of the function). One skill you will develop over time is being able to know what a good learning rate is based on the data. By iterating a few times, and plotting updates after each iteration, you will understand the entire algorithm and be ready to turn it into code.

side note: notice how the MSE is not actually used to update variables ‘m’ and ‘b’. It is used strictly for tracking the performance of an algorithm to ensure it is working correctly and can be eliminated if tracking performance isn’t necessary.

First iteration:

Calculate y^

Calculate MSE (optional)

Update m

Update b:

Prediction line after the first iteration

Second iteration:

Calculate y^

Calculate MSE (optional)

Update m

Update b

Prediction lines after the second iteration

Third iteration:

Calculate y^

Calculate MSE (optional)

Update m

Update b

Prediction lines after the second iteration

The predictions are slowly moving towards the true line (the answer) as parameters m and b update with each iteration. The algorithm continues until a set number of iterations has been completed or a stopping condition is met ( ie. the change in MSE is below a specific threshold value).

After just 3 iterations, the MSE has dropped from 60.8 to 0.65.

MSE over 3 iterations

After 500 iterations, the predictions are close to the true line (y), meaning the correct values for m and b are close to being found.

Prediction lines after the 500 iterations

You can now solve a linear regression to find variables m and b when given input x and output y. This means new input data x can be used to accurately predict future output data y. Below is the python code I wrote to solve the linear regression problem above. I’ve commented the code to the best of my ability to help you follow along and understand what each line is executing.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Linear Regression ##

iterations=500 # number of iterations
m = 1.5 # actual value of m
b = 2.0 # actual value of b
x = np.arange(1,11) # input data 'x' = [1,2,3,4,5,6,7,8,9,10]
y = m*x+b # calculate the correct answers using actual m and b values
m_guess = 0.5 # initial guess of value of m
b_guess = 0.25 # initial guess of value of b
lr = 0.01 # learning rate


preds = [] # used to store predictions of each iteration
mses = [] # used to store the mean squared after every iteration
for i in range(iterations): # for (num iterations) iterations
pred = m_guess*x+b_guess # calculate prediction using current m and b guesses
preds.append(pred)
mse = np.mean(np.square(y-pred)) # calculate the mean squared error
mses.append(mse) # add the mean squared error to the mses array
print(f'mse = {mse}') # print the mse to the log
m_guess-=lr*np.mean(-2*x*(y-pred)) # update m guess value
b_guess-=lr*np.mean(-2*(y-pred)) # update b guess value

plt.plot(mses) # plot the mean squared error at each iteration
plt.title('Mean Squared Error per Iteration')
plt.xlabel('Iteration')
plt.ylabel('Mean Squared Error')
plt.show() # render and show the plot

plt.plot(np.arange(1,11), preds[0], ls='--', alpha=0.5, label='iteration 0') # plot the first prediction
plt.plot(np.arange(1,11), preds[49], ls='--', alpha=0.5, label='iteration 50') # plot the 50th prediction
plt.plot(np.arange(1,11), preds[99], ls='--', alpha=0.5, label='iteration 100') # plot the 100th prediction
plt.plot(np.arange(1,11), preds[499], ls='--', alpha=0.5, label='iteration 500') # plot the 500th prediction
plt.plot(np.arange(1,11), y, c='r', alpha=0.7, label=f'True Line') # plot the true y line
plt.title('Predictions vs. True') # set title
plt.xlabel('X') # set x axis label
plt.ylabel('Y') # set y axis label
plt.legend() # show a legend
plt.show() # render and show the plot

print(f'final m_guess: {m_guess}') # prints final m_guess value to log
print(f'final b_guess: {b_guess}') # prints final b_guess value to log

Linear Regression is great if you are trying to predict only one variable that has a linear relationship with the output:

The regression will find the line that best fits by tuning variables m and b to map any input x (hours worked) with any y (paycheck). In this case, m would compute to be your hourly wage, and b would be a base salary (y when x=0)

But what if there is more than one input? What if you would like to predict the value of a house based on its size and the year it was built? For that, we need Multiple Linear Regression, which I will cover in my next post.

--

--

Jason Hayes

I'm a self-taught programmer and AI enthusiast with a background in Game Design and Game Art and Animation