Ever wondered how your credit card company identifies those suspicious transactions on your card and alerts you? And how gmail automatically identifies and classifies mails as spam or not spam?
Problems like these are called classification problems which demand separating your data (credit card transactions or emails) into two (More than two classes are also possible with an extension of this concept of logistic regression) different categories or clusters. If you are a doctor who wants to diagnose a possibly cancerous tumor by looking at the tissue image or a loan officer who wants to know whether the next customer is likely to default, logistic regression can come to your rescue.
Unlike linear regression where we estimate the trend of a continuous data using a linear approximation, logistic regression gives you Yes/No answers. (I would highly encourage you to read the previous article in this series about linear regression. Here’s the link — Linear regression in layman terms).
Let’s look at a few examples to see what logistic regression really tries to do.
Suppose you are a scientist working at a cancer research hospital. You want to look at the shapes and sizes of different tumors and predict whether it is malignant or not. You’ve been given some samples marked with their true values which when plotted look like this.
Based on this data you have to learn a boundary which separates the data into two distinct regions. Now when you see a new tumor your boundary will tell you whether the tumor lies in the malignant region or a non-malignant region.
There are a few things to note; First, these types of algorithms are known as supervised learning algorithms. Second, You need to find a boundary, called a Decision Boundary given the real labels or the true values. The central premise of logistic regression is the assumption that your data-set can be separated into two nice regions, one for each class (there are ways to make these boundaries non-linear, we will not go into that in this article). Because of the linear nature of this boundary it is also called as a Linear Discriminant. What is a linear boundary? For two dimensions it is a straight line, for 3 dimensions it is a plane, no curving and so on. The linear boundary for the above data look like —
But what does this linear boundary really mean?
Can we say that all the points to the right (or above) of this line are surely malignant and to the left (or below) are surely non-malignant? No. The points farthest away from the line to the right are most likely to be malignant i.e you can say with higher confidence that those tumors are malignant, the closer you come to the line your confidence or likelihood decreases and when a point is on the line you are not really sure, there is a 50/50 chance that the tumor is malignant.
Lets look at the mathematical aspect of this. Geometrically, this line divides the 2-D space into two distinct regions. Let us represent the tumor size by x and red color intensity (or the shape) by y. The equation corresponding to this line will look like
It is important to note here that x and y are both input variables and we are not predicting anything with the value of y unlike linear regression. Consider a new data point (a,b) we plug these values in the above expression. Note that when the left hand side of this equation equals 0 then this represents a line, the LHS can take any values other than 0 depending on values of x and y.
Hold this expression in your head, assume that beta1 = 1 & beta2 = 1 and gamma =15 and we will see how this helps us in predictions. There are 3 possible outcomes —
- The point (a=12, b=12 in the above figure) lies to the right of this line (green cross) in which case the value of the expression will be positive lying somewhere between 0 to infinity for different values of (a,b).
2. The point (a=6, b=6 in the plot) lies to the left of this line. In this case the value of the expression will be negative and lying somewhere between -infinity to 0 for different values of (a,b).
3. The point (a=8, b=7 in the plot) lies on the line in which case the value becomes zero.
In the first two cases the farther away the point is from the line the higher is the probability that it belongs to that class or in other words the more certain we are that it belongs to that class. Third case where the point lies on the line is the case where there is a 50/50 chance of the tumor being malignant.
OK, so now we have a mathematical function which gives a value between negative infinity and infinity, which is the whole 2-Dimensional space. Though we want to say how likely or what is the chance that this tumor is malignant. We would need a number which is between 0 and 1 indicating the possibility of tumor being malignant. So 0.2 means there is a 20% chance of tumor being malignant, 0.8 means there is a 80% likelihood of tumor being malignant and so on. The number indicates the confidence of prediction.
How do we map this to a probability which lies somewhere between (0,1). We use a mathematical function known as Sigmoid function or a logistic function (and hence we name the algorithm logistic regression)
The Sigmoid function looks like the curve below, it squishes the values outputted by our expression E(x,y) (X axis) in the range of (0,1) (Y axis).
The mathematical form of this function is —
We also call the final values as hypothesis H —
But don’t worry about the mathematical expression for now just understand that this function helps to transform the values of our expression in the range of (0,1).
OK, so that is about how you predict the class of a new data point when you already have a boundary which divides your data set in two distinct regions. But how does our algorithm learn this boundary? After all this is a post about a machine learning algorithm, where is the learning part?
The cost function. Math alert!
Similar to linear regression, logistic regression also has a cost function (so do most of the machine learning algorithms) and our task is to minimize that function. The general idea of a cost function is to generate a high number if the prediction by our hypothesis function is too far from the true value. Bear with me for a while, let’s look at the math, we will try to build an intuition as we go along.
Remember how our hypothesis function looked like, let the parameters (beta and gamma) be some random initial value or guesses. Keep in mind that this function gives values lying between (0,1)
We want our cost to be high if this function predicts a value far away from 1 if the true value is 1 and if it is far away from 0 if the true value is 0. Similarly we want our cost to be low if the prediction is close to the true value.
We define our cost function in two parts. Note that the Y (in capital) here is the true value of that data point. Why we choose this specific cost function (and not a linear function or any other function for that matter) is again a topic for another post:
Stare at the graphs for a while, these are the cost functions when the true value is 1 (left) and when it is 0 (right).
Imagine you have a data point the true value of which is 1 and your hypothesis function gives you a value of 0.1 (red dot in the left graph) the cost for such a point is high. Similarly if for a true value of 1 your hypothesis outputs a value of 0.9 the cost goes down to near 0 (black cross in the left graph). The same mechanism works for true values = 0 as shown in the right graph. The red dot corresponds to a hypothesis value of 0.1 and the black cross for a hypothesis value of 0.9. That’s it, we now have to minimize the cost by adjusting parameters beta and gamma for all our data points. The minimization is done by our old friend Gradient Descent. That is all there is to logistic regression. Let us see another example with logistic regression in action.
Suppose you want to predict whether you will be selected in a particular university based on scores you obtained in two different examinations. You have data for various students for previous years.
Look how after the first few iterations of the optimizer the linear discriminator looks.
After 10 iterations our line is getting there!
After a few more iterations. Voila!
You can see the learned boundary separates the data in two regions. Although it is not a perfect separation which will result in lesser classification accuracy, i.e. there will be some points with incorrect classification. But now you can predict your admission based on the scores you’ve got in the two examinations.
Logistic regression is one of the most basic classification algorithm in statistics. It has its limitations and caveats. For instance, it does not fare well if the observations (Exam 1 scores and Exam 2 scores) are related to each other. But nevertheless it is definitely a good foray into the land of classification algorithms. I hope this post has incited some curiosity in you to go and learn more about logistic regression. There are other techniques which are used to make the boundary non-linear or use logistic regression for multiple classes. There is a lot to learn!
X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such experiment driven simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please clap and comment below.
Thanks for Reading!