Understanding basic concepts of Machine Learning (Gradient Descent)

Pau Orti
7 min readAug 27, 2020

--

Gradient Descent is one of the most commonly used algorithms in Machine Learning, in this article we will work through the complexity and try to make it understandable for everyone.

  1. Introduction to Gradient Descent

Simply put, Gradient Descent is an algorithm which aims to find the optimal points in a distribution, normaly with prediction purposes. For exemple, let’s imagine that we want to find a way to accurrately predict the property prices of a residential area in the vicinity of Barcelona. We are given a scatter plot (shown below) that holds the values of price and square metres of properties recently sold in the area. The x-axis holds the total square metre values, and the y-axis, their corresponding price. Hence, each dot represents a property with a determinated price and square foots.

What can the Gradient Descent algorithm help us out with in this context ?

What Gradient Descent is set to do, is find the best fitting line between the dots, also known as linear regression model, the model will express the linear relationship between Square Metres and Price, and hopefully would help us predict the market price for future properties on the market.

To explain the Gradient Descent Algorithm we will start by drawing a randomly positioned line in our graph. And we comes up with this:

2. Seeking the best Error Measurement

Now will have to give the algorithm some mathematical error measurement so he knows how badly he has performed, how far away he is from the optimal point and what steps to take to get there on. However, prior to doing that, we’ll have to teach the algorithm something even more basic, that is, what a linear function looks like mathematicly.

Now the machine knows that a linear function is made up of two parameters that he can tweak to get different linear functions.

Next, we have to think about the error measure we are going to train the algorithm with. Our optimal line should be the one that minimizes the distance between the line drawn and the given points in our graph, that is, the linear function that furthest diminishes the difference between the actual prices and the predicted ones for all given square metre levels. In the mathematical language would look like this:

Therefore, what the above mathematical expression says is: look for a function f(x) that minimizes the sum of the residuals (actual minus predicted values). To be more mathematically precise we should’ve said that the function looks to minimize f(theta_0, theta_1), the thetas are the parameters that change our predicted y, therefore, our final error mesurement. But hold on a second, the function above doesn’t quite add-up, because let’s say we have a pair of predictions, the first one misses the actual price by +50,000 €, and second one falls short by -50,000 €, then, if we go ahead and calculate the error by adding up the two prediction errors, we would get 0, the algorithm would wrongly regard it as a perfect model. To prevent this from happening, we are going to tweak our error measurement equation a bit, and come up with a modified version that goes by the fancy name of the mean squared error (MSE). Here you have it:

The m term you see added to the equation stands for the number of instances (dots) that we are going to compute, it is multiplied by the other part of the equation to get the mean value.

So, to recap, we are are telling the algorithm we want a linear function that minimizes the mean squared error, wich is a measure that computes the difference between predicted and actual values, so we want that difference to be as small as it can possibly be given a linear model.

3. Teaching Gradient Descent

Finally, we have to teach the algorithm how to navegate the graph looking for the optimal regression line. Well go about this task by first asking him to computing different values for both, the intercept and the slope (components of linear function) and evaluate the their mean squared errors. The graph below shows an exemple of what we’re trying to come up with.

NG, Ritchieng. (2020). One Variable Linear Regresion (Figure) https://www.ritchieng.com/one-variable-linear-regression/

Notice that what we have here is a three dimentional graph, where the vertical axis represents the error term (mean squared error) for every pair of values in the linear model (intercept — theta zero, and slope — theta 1).

Therefore, we want to find a line that minimizes that error, differently put, we are looking a combination of parameters (interecept and slope) that bring the error term (graph vertical axis) down to the lowest possible point (minimum). For those of you who know a bit about algebra, a bell must have rung in your head, what we need here is to derivate the equation and solve for the pair of parameters that make the derivative equal to 0, that would be the lowest point in our graph, hence our two parameters for the best fitting line, Gradient Descent does exactly that.

4. Intuition

The graph below shows fundamentally the same as the 3-D graph plotted earlier, but looked at only from the theta_0’s perspective, in other words we simplified the graph by omitting one of the axis (theta_1).

The red tangent in the graph represents the derivative of the point our machine chose to start from, in our search for the best fitting line. We want the algorithm to find out the derivative that equals 0, it would be represented as a flat tangent in our error function.

What Gradient Descent does is, through an array of iterations (steps), goes progressively down the error function approaching its minimim point, when it finally attains it, the algorithm understand it has converged to the optimum solution, and stops. As you can, hopefully, appreciate in the graph below, the steps shorten as the minimum is approach, we’ll get into why that is when we explain the mathematical logic behind Gradient Descent .

It is important to bear in mind that this way-to-the-minimum process is carried out by both theta_0 and theta_1 concurrently, meaning the next point in our graph is influenced by both, the step in theta_0 and theta_1, even though the later is not depicted above, you have to imagine a graph where the algorithm is figuring its way down to the minimum but, this time, for theta_1.

5. A bit of maths…

Understand that we are being a bit sploppy, mathematically speaking, in an effort to do away with complex mathematical notation.

Let’s go through the components of the formula below from right to left. First, we have the derivative of both thetas with respect to the mean squared error (vertical axis in our 3D graph), that is simply the derivatives we have drawn as tangents in our graph above. Note that in our graph those derivatives are negative (negative slope), and they are flattened (less negative) as they approach the minimum. The next term we have is the Learning Rate in blue, this is simply the step size of our Gradient Descent. And by multiplying those two terms and substracting the result from the last theta’s values, we get the new thetas, with wich we will repeat the steps.

Note that this make sence because if, like in our case, our derivative of theta_0 starting point is negative, let’s say -8, by multiplying it by a learning rate of 0.1, we get -0.8, which once substracted by the prior theta value, makes it increment (two negative turn into positive) wich will result in an advance in theta, nudged towards the right hand site of our graph, precisally where our minimum value lays (exactlly what happened in our graph).

On our second iteration, the slope would have been diminished, let’s say to -5, therefore when multiplied again by the learning rate 0.1, equals -0.5, hence 0.5 to be added to our latest theta value, resulting in a smaller step that the prior one (0.8), and the next step will be one more time smaller than the later, and so on. This is the logic why our steps are increaingly smaller every time Gradient Descent advances along the error function and toward the optimum point.

6. Applying Gradient Descent to our data

Now that we have a grasp on the logic the algorithm follows. Let’s apply the Gradient Descent Algorithm to the data of property prices in Barcelona we started with, and see how it does.

Voilà, exactly as expected! The algorithm started by drawing a random line in the plain (bottom line) and through an array of iterations, in which the two parameters of the line (theta0 and theta1) are constandly tweaked, the algorithm finally gets to the best fitting line, which at the same time equals the minimum possible mean squared error (our error mesurament) in the graph (the minimum point in the 3D graph).

--

--

Pau Orti

DataSience Master’s Student. Finance and Accounting enthusiast, aspirant to be an out-of-the-box thinker, world-traveler.