Introduction to Machine Learning Pt 3
written by Stephen Wilson
Quick Refresher — what is classification
In earlier posts we introduced the concept of supervised learning. Supervised learning refers to when we want a machine learning system to learn a mapping from a set of input features to some output variable. We also touched on the two broad areas of supervised learning: regression, which is when the inputs map to a continuous output variable, and classification, which is when the inputs map to a fixed number of discrete values or classes. This post will cover the basics of one type of classification algorithm and give you the intuition behind some of the principals involved.
Simple Use Case
At Scout24 one of our goals is to provide the most relevant services and products to our consumers, ones that are personalised according to their needs and preferences. For instance, we might like to be able to look at the interactions a consumer has with the ImmobilienScout24 platform and determine from that whether they are a homeowner or not. This allows us to show them products and services that might be especially helpful or interesting to homeowners but not to other groups (such as people who are looking to rent an apartment). This is a classification task. We have a set of inputs (how someone interacts on ImmobilienScout24, the pages they visited, which sections they clicked, which ones they didn’t etc.) and an output variable that can only take on a number of fixed values (in this case two): homeowner and non-homeowner.
Let’s take a look at what an example dataset might look like. As in the linear regression example before, our dataset consists of rows and columns. Here, each row corresponds to the history of how an individual interacts with ImmobilienScout24 and the columns correspond to the input variables we would like to map to the output classes. To keep things simple, we’ll just use a single input which corresponds to the aggregated number of events that a user triggers on a single page on ImmobilienScout24. We add an additional column to represent whether someone is a homeowner and use a 1 to indicate if they are and a 0 if they are not. When preparing the dataset, we make sure that we pick people who we know have signed contracts with ImmobilienScout24 to rent out their homes and we label these with a in the homeowner column. We then randomly sample non-homeowners from the set of people who have not signed contracts with ImmobilienScout24. The figure on the left below shows a small constructed dataset for this problem.
When we plot the data we immediately notice that because we have binary values for homeowner and non-homeowner the data-points are all gathered at only two places on the y-axis: 0 and 1.
Let’s look again at the graph, above. As we have already noted all of the output values are either 0 and 1 . We already know that the goal of the machine learning system is to learn how to approximate the output variables, given some input.
So, instead of the system predicting either 0 or 1, might it be enough if it predicted some value between 0 and 1 . and then we applied a rule that mapped predictions greater or equal to 0.5 to the class labelled 1 and all predictions below 0.5 to the class labelled 0?
Well, yes it is. One technique we can apply here is called logistic regression.
We already have learned that regression is when we want the machine learning system to learn how to predict a continuously valued output variable. With logistic regression we want the system to predict a continuous value between 0 or 1 . What this really represents is the probability of belonging to the positive class (in our example, this is the homeowner class). Logistic regression becomes a classification algorithm if we apply a threshold to the output and label everything above the threshold as “homeowner” and everything below it as “non-homeowner”.
Linear combination of inputs
In logistic regression we begin with a linear combination of our input features β₀ + β₁ X. However, we also know that due to the nature of the classification task our output cannot be less 0 than and it cannot exceed 1. Therefore, we need a way to constrain the model so that the output always falls within the range of (0,1). In the previous post we introduced the concept of a mathematical function, which is an operation that maps some input to some output. For logistic regression we will use a special type of function to constrain the output of our model. This function will accept the linear combination of features as input and transform the input so that always maps to an output in the range (0,1).
Take a look at the curve shown in the figure below. It shows a plot of a particular type of function called a logistic function. It’s often called a sigmoid function too. The name comes from the Greek letter sigma (σ) which broadly corresponds to the English letter S (and you can see that the curve is S-shaped).
This curve has some special properties. Probably the most important for our purposes is that it is bounded, which means that as values on the x-axis increase or decrease the corresponding values on the y-axis get approach 1 and 0 respectively but don’t exceed them, even for very large and very small values of x . This is exactly what we want: a function that can take input values in any range and produce output that is “squeezed” into a range between (0,1).
The function can be represented by the formula
The e in this formula is a mathematical constant with an approximate value of 2.71828. You might remember this number from school as the base of the natural logarithm. If you don’t remember it, it’s ok. The important takeaway is that there exists a function that allows us to map any value x into a range between (0,1).
So how do we use this function in our logistic regression? The formula
takes a value x and squeezes it into our desired output range. So, we can just take the linear combination of our inputs and plug them into the formula where we see x so that we get
Ok, so we are gradually assembling the different parts that we need to apply the logistic regression algorithm. We described in previous posts that, in general, for machine learning we need:
· a starting point for the system so that it can begin to make predictions,
· a way to measure the error between the predictions when compared against the “ground truth” values,
· a mechanism to correct those errors so that the system improves with each iteration,
· a way to know when to stop.
How do we apply these steps to logistic regression? The first step is easy. Like linear regression, we can simply initialise the values of β₀ and β₁ at random to begin with.
Making a first prediction is also easy. We take the logistic function and simply plug in our values for β₀ and β₁ and X (our input value, in this case the number of events triggered by a user on a particular page). So far so good. But what is missing for logistic regression is a way for the system to know how good its predictions are as well as how to make small changes to the values of β₀ and β₁ so that the predictions get better with each iteration. For that, we need an error or loss function.
The loss function for logistic regression is shown below. It looks a little intimidating, but we will explain what it means and hopefully give the intuition behind it and why it is useful.
The first part
just means that we apply the function to each data-point in the training set (start at i=1 and keep going till we reach the final data-point i=n) and sum everything up so that we get a single value representing how well the system’s predictions match up against the ground truth labels in the training data. Note that there is a minus sign before the Σ sign, which means that after summing up the outputs we will multiply the sum by -1.
Before we get to the rest of the function, let’s take a look again at the toy dataset we introduced above and look at the first two rows, which show an example of the negative class (homeowner == 0) and the positive class (homeowner == 1).
Let’s also imagine that we have initialised β₀ and β₁ to have starting values of 0.5 each. In order to make the initial prediction for these two data-points we simply plug them into the linear combination of inputs formula β₀ + β₁X to get the outputs shown in the table below.
Next, we need to take these outputs and use the sigmoid function to “squeeze” them into the desired range of (0,1).
We see that with these arbitrary initial values for β₀ and β₁ the system predicts with very high probability that both these users are homeowners. However, we know from the ground truth labels that only the user with predicted probability of 0.999 is a homeowner and the other with predicted probability of 0.924 is not.
Ok, let’s return now to the loss function. Here it is again:
value in the function represents the ground truth value from the training set and will always be either 0 or 1. And the predicted probability is what we have just calculated for both of example data-points using the combined inputs followed by the sigmoid function.
Let’s take both example data-points in turn and plug the corresponding values into the loss function to gain some intuition about how it can help us. We’ll leave the summation out for now and just consider each data-point individually. First the non-homeowner who has a ground truth label of 0 and a predicted probability of 0.924:
We see something really useful. Because the ground truth label is 0 the expression to the left-hand side of the + sign evaluates to zero because anything that is multiplied by zero is also zero.
The right-hand side of the + sign is all we need to evaluate and that simply becomes the log (1– 0.924) which is the same as calculating the which works out to be – 5.12.
When we consider the second data-point in our example we see that we only need to evaluate the expression on the left-hand side of the equation because the ground truth value of
is 1 and the expression on the right-hand side is multiplied by 1 minus the ground truth, which in this instance evaluates to zero (1–1=0) and anything multiplied by zero is zero, so this side simply falls away. The left-hand side then becomes which is 1*log (0.999) which is -0.001.
In practice, the machine learning system does this for every data-point in the training set and sums all the results together. There is one final thing to note. All of the predicted probabilities necessarily lie within the range (0,1), as we have already noted. The logarithm of any value between 0 and 1 is always a negative number and consequently we will have a negative value after we perform the summation.
So, as a final step we multiply the final result by -1 to make our result positive. Note the minus sign before
in the formula below. This is what this does. We make sure that we have a positive number because it makes it easier to compare our logistic regression classification against other algorithms. If you recall from the post on linear regression the machine learning system uses an algorithm called gradient descent to iteratively improve its prediction. It does this by trying to find the minimum of the error or loss function. By making sure that the loss function for logistic regression always gives a positive value we ensure that we are also trying to find its minimum.
The intuition behind why this loss function is useful is as follows. For positive cases in the training data, if the machine learning system does well and predicts the correct class with a very high probability then the corresponding negative log of the probability will be already very close to zero (the logarithm of 1 is zero, so the closer the probability is to 1 the closer to zero we will be).
For negative cases, the machine learning system does well when the predicted probability is very low. For negative cases we compute the negative log of (1– predicted probability). When the predicted probability is very low this means that (1– predicted probability) will be close to 1 and the corresponding negative log will be close to zero.
Minimising the loss function with respect to our input parameters β₀ and β₁ across all data-points will produce the best possible predictions for the training data.
As touched on above, we also use the gradient descent algorithm to find the minimum of the logistic regression loss function. Recall that gradient descent works by applying the loss function to all data-points in the training data for some values of and , then taking partial derivatives of the function in order to determine the gradient of the function at those values (the gradient of the function at those points indicate the direction of travel towards the function minimum).
Finally, we make small adjustments to β₀ and β₁ in direction of the gradient and repeat the whole process until we find the function minimum.
As we already know, the predicted probability in the loss function for logistic regression is calculated using the sigmoid function:
So in order to calculate the gradient of the loss function we will have to calculate the derivative of the sigmoid function.
A really nice property of the sigmoid function is that it has a very easily computed derivative. The derivative of the sigmoid function is simply:
In other words, to get the derivative you just multiply the predicted probability by 1 minus the predicted probability.
Making a prediction
The figure below shows how the model can be used to make predictions on unseen data. We simply supply some input features to the model (the features are the same as those we used during training), the model outputs a probability, we apply a threshold and if the probability is above that threshold, we assign the positive class label.
I am one of the Data Scientists in Residence and work mainly for Scout24’s real estate platform ImmobilienScout24. I have a PhD in Computer Science and a background in computational linguistics and speech processing. The Data Science Team is hiring! If you have a passion for machine learning and data science (and like to inspire that same passion in others) then come and join our team. Open positions can be found here.