Introduction to Machine Learning Pt 2

written by Stephen Wilson

In the last post we introduced a metric called the Root Mean Squared Error (or RMSE), which is a metric commonly used in linear regression and which allows us to get an overall impression about how well the regression model is able to approximate the training data.

We also introduced some of the guiding principles necessary for building a machine learning system, namely:

· Providing the system with some initial values for our weights β₀ and β₁ so that it can make some first predictions,

· a way for the system to know how well it is doing (this is where we use RMSE),

· and a way for the system to make adjustments to β₀ and β₁ so that the predictions improve with each iteration.

This next post will explain how we can implement this error correcting mechanism. We’ll try to keep it as high-level as possible, but we assume you have at least some familiarity with the concepts of functions and derivatives (even if it is just what you remember from school).

Correcting the error and knowing when to stop

In the last post we simply set the values β₀ and β₁ at random. The reason why it doesn’t matter so much at the beginning what values we give, is that we really only want the system to be able to make initial predictions. How good these predictions are at the beginning is not important (and in fact, we accept that they will most likely be quite far off the “ground truth” values in the training data.

After making the initial predictions with our starting values we calculate the RMSE for the whole training data set and see how well or how bad the system is doing. The question is: how can we tell the system how it should adjust the values of β₀ and β₁ so that on the next training iteration the predictions (on the whole) improve and the RMSE gets smaller? It turns out that there is a simple but extremely smart mechanism that enables us to do that. Before we get to the intuition underpinning this mechanism, let’s revisit some concepts that you should remember from your high school maths class.

RMSE as a function

To be able to properly understand how the error correction mechanism in a machine learning system works, you need to refresh yourself with the concept of a mathematical function. A function at its simplest is just an operation that maps some inputs to some output. For example, we could define a function that maps a numerical input to an output by adding the 2 to it:

def addTwo(x):

return x + 2

Given the input 3 this function would map it to the output 5.

We can consider the calculation of the RMSE to be a function that takes the predicted prices and the actual listing prices as input and calculates the RMSE based on these values, returning a single number. The aim of our listing price prediction model is to have predictions that are as close to ground truth values as possible, which means in turn that the RMSE will be as small as possible for our training set.

If we selected a range of different values for β₀ and β₁ and calculated the RMSE for each and then plotted these RMSE values on a graph, we might see something similar to the figure shown below.

The values for β₀ and β₁ are plotted along the labelled axes as shown and the corresponding RMSE is calculated for the training data and plotted on the vertical axis. This produces a 3D bowl-shaped graph. The higher edges of the bowl indicate where the RMSE is large and the smallest possible value for the RMSE is at the lowest point of the bowl, towards the centre.

So how does this help a machine learning system to learn? The goal of fitting a model to our training data is to do it in such a way as to minimise the RMSE. In other words, we need to find the bottom of the bowl in the graph. If we knew the shape of the bowl beforehand then we could pick the values of β₀ and β₁ at that position and we would be done. But of course, things are just a little more complicated (but only a little). Here is the intuition about how we can go about finding the minimum of our RMSE function.

Recall that we simply gave random values as the starting values for β₀ and β₁ (or set them to zero). When we calculate the RMSE for these starting values we can locate the corresponding point on the bowl-shaped graph. Even though we don’t know where the bottom of the bowl is, if we were able to stand at that point on the bowl surface (bear with me) we would be able to tell the direction in which the bowl surface is sloping as well as the steepness of the slope at that point.

Then if we were to take a small step in this direction we would move a little bit closer to the bottom of the bowl. From this new point on the bowl surface, we can again tell the direction and the steepness of the surface slope and take another small step. We can repeat this process until the slope of the surface is flat and non-sloping. Then we know we are at the bottom and we have found the values of β₀ and β₁ which produce the smallest possible RMSE for our data and we can stop.

What is described above is an algorithm called gradient descent, so called because we calculate the gradient or slope of the error function and then “descend” the function surface in the direction of the gradient until we reach the function minimum, where the slope is zero. To be able to implement the gradient descent algorithm into our machine learning system, we need to use some more maths.

Recall that for any values of β₀ and β₁ we can find the corresponding value for RMSE located on the surface of the bowl-shaped graph above. If we were to draw a tangent to the surface at that point and measure its slope, we would have the gradient we need to tell us which direction we need to move in to get to the minimum.

Conveniently, we can compute the gradient of a function by taking its derivative. You might remember derivatives and differential calculus from school. If your recollection is hazy, don’t worry. Taking the derivative of a function at a certain point tells you if the function is increasing or decreasing at that point. This is exactly what we want. If the derivative at the point is positive, it means the function is increasing. If the derivative is negative, it means the function is decreasing. And if the derivative is zero, then we have found the function minimum.

What we need to do is to calculate the derivative at a given point (i.e. for some values of β₀ and β₁) and if the derivative is negative at that point it means that the function is decreasing, so we take a step in the same direction by multiplying the current values of β₀ and β₁ by the function gradient in order to calculate their new values for the second iteration. If the derivative of the function at any point is positive (meaning the function is increasing and therefore moving away from the bottom of the bowl) we simply reverse the sign of the gradient and take a step in this direction (because if the function is increasing, then we know we want to move in the opposite direction).

With our new values for β₀ and β₁ we make new predictions, calculate our new RMSE and calculate the new gradient at these values and again take another small step towards the function minimum at the bottom of the bowl. This iterative process repeats until the gradient is zero or close enough to zero to be acceptable. Then we have the values of β₀ and β₁ which best fit our training data, as they produce the smallest RMSE.

The figure below shows a plan view of the bowl-shaped graph above with the steps taken at each iteration marked in red. You can clearly see how we are moving towards the function minimum in the centre with each step taken.

I am one of the Data Scientists in Residence and work mainly for Scout24’s real estate platform ImmobilienScout24. I have a PhD in Computer Science and a background in computational linguistics and speech processing. The Data Science Team is hiring! If you have a passion for machine learning and data science (and like to inspire that same passion in others) then come and join our team. Open positions can be found here.