Machine learning — An illustrative mathematical explanation of linear regression using gradient descent

9 min readOct 4, 2023

Machine learning (abbreviated as ML) is a set of algorithms that gives computers the ability to “learn”. It uses data in the form of inputs, and its corresponding outputs or goals to train computers with the help of various mathematical models. There are three types of ML algorithms:

1. Supervised learning:

Supervised learning is the most popular kind of ML algorithms. It tries to find a connection that maps -to the furthest extent possible- a group of inputs (or features) to its corresponding outputs (or labels) with minimal loss, it’s the main subject of our article.

2. Unsupervised learning:

In this type, we only provide the input data and we tell the algorithm to train from this unlabeled data (or raw data), so its primary goal is to find hidden patterns, structures, or relationships within the data without any prior knowledge. It has many applications one of them being Recommendation systems where it Suggests products or content based on user preferences.

3. Reinforcement learning:

in this kind, we have what’s called “an agent” learning to make decisions in an environment designed to achieve predetermined goals. and every time it comes closer to the goals, we reward it, so it knows whether it’s going in the right direction or not (basically, it’s teaching itself). One example of Reinforcement Learning is the parkour agent developed by DeepMind.

funny GIF for DeepMinds agent after teaching itself how to parkour

As we said, we will focus on Supervised learning. Many excellent algorithms utilize it such as k-nearest neighbors and decision trees, but here we will focus on LinearRegression, which is one of the simplest ML algorithms, and we will go deep into the math behind it.

so let’s do some math!!!!!

LINEAR Regression:

Machine learning algorithms are about predictions. we give it data and expect in return to have valid results. in linear Regression algorithms, we do this by fitting a “linear function” into the data distribution, but when do we identify a function as linear?

The simplest form of linear functions is one that has only one positive feature and passes through the origin point or the (0, 0) point in the XY-coordinates plane. plotting it on the XY-coordinate looks like this:

and its equation looks like this:

w is an abbreviation for weight, it’s the same as the slope

where x is the value of the feature we enter to get a prediction, while w is the slope of the linear function. Which is a measure of how fast the values of a line are changing, or how steep the line is. if the slope is positive the line values increase, but if it’s negative the values of the line decrease — this gif illustrates a line while changing its slope from -10 to 10:

So enough of this, let's get into the real stuff !!!

We will explain linear regression with the help of a very famous example, Which is the houses’ prices example. Imagine you’re a broker and you want to sell a house in a particular neighborhood and in this neighborhood, the only factor that affects the house price is its size. Of course, as a professional broker, you must have an estimation for the price, you won’t just say anything out of thin air. you must have a criterion to give a good prediction, and the best estimation method, in this case, is predicting our price by comparing it to the costs and sizes of the houses that have already been sold in the same neighborhood — This is where ML comes in.

Let’s first plot the prices and sizes of the other houses:

To predict we need to fit a line into these houses' prices, We won’t just put it arbitrarily at any place, But there’s a perfect place between these dots that we must find. We will use the line slope to move it and decide which one is best.

The possible lines we can choose from to decide our best fit

However, we need a method to compare the different lines and choose which one fits our dataset. For this, we have what’s called a “Loss Function” which is a mathematical equation that calculates how far our prediction is from the real value. Of course, there are many loss functions, but here we will just calculate the difference between the real and predicted values and then square it:

But this is just for one house and we want to find the loss from the whole data distribution, so we will find the sum of all the loss functions and divide it by the number of houses. By other means, we calculate the average loss function and this has a different name which is “cost function”:

The general graph of this function looks like a U shape:

the general shape of functions with a power of 2

Observe how the cost function’s value initially decreases and then climbs after it hits a particular threshold point. this point -the gray point- represents the minimum value of the cost function, and it’s the cost function for the best-fit line.

We can divide linear regression into two steps: forward propagation and backward propagation. forward propagation is what we have seen. it gives us predictions from the input features and calculates how far we are from a best-fit line with a cost function, while backward propagation uses the values of the cost function to update our predictions in order to reach the best-fit line.

Backward propagation:

But before we dig into it, We must review a crucial subject “Differentiation”. For this, I asked ChatGpt to explain it: “Differentiation is a way of figuring out how something changes when something else changes. It’s like understanding the relationship between two things.

Let’s say you have a car, and you want to know how its speed changes when you press the gas pedal (Acceleration). That’s where Differentiation comes in.

You start by measuring the car’s speed before you press the gas pedal, let’s say it’s going 30 miles per hour (mph). Then, you press the gas pedal and measure the speed again after a few seconds, and it’s now going 40 mph.

To find out how much the speed increased when you pressed the gas pedal, you use Differentiation. In this case, you’d say the car’s speed changed by 10 mph when you pressed the gas pedal. This is like finding the “rate of change” or “how fast something is changing.”

So, Differentiation helps you understand how one thing (speed) changes when you do something to another thing (press the gas pedal). It’s a fundamental concept in math and science used to analyze how things change in the real world.”

me after using ChatGpt to save hours of writing and thinking:

So Differentiation and the rate of change are the same thing. the derivative of a linear function is the same across all of the points -it’s the slope of the line- while the derivative of a non-linear function differs from one point to another. In fact, it’s the slope of the function’s tangent line at that point, But we can’t find the rate of change from just one point, So we will use a trick. We will add a very small value (infinitesimally small) to the x coordinate of our point, and then calculate the rate of change from these two points, its equation will look like this:

“lim” here means h is very close to zero. This equation is called derivative from the first principle, We use it to obtain different derivative rules.

The main thing we want out of Differentiation is the sign. when we Differentiate a function at a point where it’s increasing the sign is positive, and when it’s decreasing the sign is negative. The tangent line has positive and negative slopes at different points. The Tangent line looks like this:

note that, From here, there are two ways to proceed. We can either use a closed-form equation to solve the problem directly, which is better in our example, or we can use gradient descent because many ML algorithms lack a closed-form equation and we want to build concepts for more complex articles upon this one. Additionally, in our case, we only have one feature, but as we increase the number of features it gets more computationally expensive to use closed-form than using gradient descent.

So, formally, Gradient descent is a numerical optimization technique used to find the minimum (in other applications we find the maximum) of a function by iteratively adjusting its parameters in the direction that leads to a decrease (or increase in other applications) in the function’s value. It utilizes Differentiation, so we will use it to find the rate of change of the cost function to the change in the line slope (w). The equation of the cost function with substituting wx in place of the predicted value:

Differentiation uses many rules that have been derived from the first principles equation we have seen above, the derivative of the cost function with respect to the slop w looks like this:

In gradient descent, we use forward and backward propagation multiple times until we are satisfied with the result, on the first iteration we randomly initialize a slope for our line (it can be any number) and calculate the cost using this random line to find how far we are. Then we find the derivative of the cost to the slope of the linear function and use this derivative to update the value of the slope using this equation:

On the second iteration, we use the updated slope and do this all over again, and so on the next iterations until we find the line’s slope that gives us satisfactory results. The sign of the derivative of the above equation tells us where to go -left or right in our example. The “learningRate” parameter is customizable, It indicates how big each step is.

The following GIFs visualize the process of learning and the value of the cost function at each step until we reach the point of convergence:

above the value of the derivative at each step is negative, So the slope of our predictions-line will increase after each step.

But here the value of the derivative at each step is positive, So the slope of our predictions-line will decrease after each step.

AT The END:

The example we provided might be the simplest form of linear regression because our houses just have the size feature but in real life, brokers have to deal with numerous features that significantly add more complications to the learning process. plus the linear function normally has a second parameter: the bias -its real form is f(x) = wx + b. however, we can follow a similar procedure to backpropagate it.

I hope this article gave you a better understanding of ML, as I will try to constantly publish more illustrative mathy articles. If you want a deeper understanding of this subject I advise you to watch Andrew ng ML course on Coursera. In fact, you can watch the first module for free on YouTube.

Stay tuned!!!