Understanding Logistic Regression
Following article consists of three parts
1- The concept of classification in machine learning
2- The concept & explanation of Logistic Regression
3- A practical example of Logistic Regression on Titanic Data-Set
There are many classification techniques or classifiers possibly around, but the most common and widely used are the following:
DDI Editor's Pick: 5 Machine Learning Books That Turn You from Novice to Expert | Data Driven…
The booming growth in the Machine Learning industry has brought renewed interest in people about Artificial…
- Logistic Regression
- Linear Discriminant Analysis
- K-nearest neighbors
Unlike the Linear Regression, where we predict Quantitative Response on the basis of X, we actually have to predict Qualitative Response, when it comes to classification.
So what really is a Qualitative Response?
In simple words, it is the categorization of our observation. We classify the output in (say A, B or C) into a category. In other words, or we may say that we assign a class to our observation.
When we predict a qualitative response to an observation, it is often referred as Classification, because it involves assigning an observation to a class or category.
Following are scenarios where we can make use of Logistic Regression:
- A person has some set symptoms, which may lead to one of the three specified medical conditions. So which of the three conditions the person have.
- A bank must asses a banking transaction done by any individual to find out whether that person is likely to do a fraudulent transaction. There are a number of features which will come into consideration to determine that, for example: past transactions, IP Address, and etc.
- An email received, has to be placed into spam folder or inbox, so the email has to be classified based on the content whether it is Spam or Legit (Ham).
- Given the passenger details from the titanic data-set, we will have to predict the survival of the passenger in 1 or 0.
The full code can be access through the following link:
We can also call it a modification to the Linear Regression, where the algorithm makes sure that the output is a probability between 0 and 1, rather than predicting the quantitative value of Y. In other words, the output is either classified as 1 or 0. When plotted in a graph, it looks like an S shaped curve.
The function being used here is the Sigmoid Function. You can also call it as a Logistic Function.
The figure shows that the Linear Regression line is going beyond 1 and also in the negative (below 0), while the Logistic Regression puts a limit to the response, such that, the resulting prediction be either 1 or 0.
The e represented in the function is called Euler’s Constant which has a value of approx 2.71828. Understanding the logarithm behind e is out of scope, but you can always dig deep into it by searching through it, as there are lots of resources around.
Let’s talk about a hypothesis output, that how would we get the value for either 1 or 0 based on our estimated probability. Consider the following hypothesis
Where hθ (x) is some number. We’re going to treat that number as the estimated probability (y with cap) that y =1, on a new input example of x.
To understand that, let’s take a very common example of tumor classification as follows
Note: the 0th value of x, as a rule, will always be 1. The 1st value is the Tumor size, and if there are other features, say 2nd value, it would’ve been Tumor age and so on and so forth…. reaching nth value of x.
We may have a feature vector x, where x of (0) equals 1 as always, and our feature-one is the size of the tumor. Suppose that a patient comes in with some tumor size and we feed the feature vector x (the tumor size) in our hypothesis and suppose that the hypothesis outputs the number 0.8, then we can interpret this as.
The probability that y=1 is 0.8. In other words, the patient should be told that there is a 80% chance that the tumor is being malignant.
Let’s write this using math, which is a formal way of writing this.
We read it as, the hypothesis output of feature vector x is, p (probability) of y=1, given the value of x, parameterized by θ theta.
Since, this is a classification task, we know that y can take one of the two values, either it will be 1 or it can be 0, by playing a little with the equation, we can write the following as well.
So now, we’ve a clear representation of the hypothesis interpretation of Logistic regression, and how the mathematical formula will look like, defining the Logistic Regression. Let’s now take a look at the decision boundary of Logistic Regression. Which means that the decision on one side of that boundary will be 0 and the other side will be 1.
We’ve to recall the sigmoid function and our hypothesis here as follows.
Where S(x) will form an S-shaped curve, which starts at 0 of x and ends at 1. Here we will see that when will our hypotheses make a prediction of Y=1, and when will it predict Y=0. We will also have to see how the hypothesis will look like when there are more than one features.
At this point, we must consider the threshold, say 0.5 such that the prediction would result in 0 or 1 under the following condition.
Looking at the below figure will help in understanding the above equation.
Now, consider the following data-set where we’ve Data Points, which are represented as orange and blue spheres.
Where, the hypothesis for predicting 1 or 0 will be as follows:
The value of θ is -10 so fitting it to the equation above would result as
Which shows that the value of x1+x2, which is 10, forms a straight line, and anything above or over that straight line will result in the prediction of y as 1. And anything below that straight line will result in the prediction of y as 0.
We call this line as the Decision Boundary.
Besides, an important note on the decision threshold is that this is a value which determines how the decision boundary will be set on predicting the probability to either 1 or 0, we are putting the threshold 0.5, but thresholds are best set by examining the cost trade offs between false negatives and false positives for values of threshold. More complex data-sets, we will have to examine the data and make a best decision on setting it somewhere between 0 and 1. Most of the times, we will have to use different thresholds via trial and error, to come up with the best predictions possible.
The decision boundaries are not always linear, they vary depending on the complexity of the data set. Let’s see the following data-set where the data points across both sides of the decision boundary are represented by orange and blue spheres.
In the case above, the hypothesis for creating decision boundary will be represented as
It is also not necessary, that the decision boundary will always be a circle, rather, it is based on the complexity of the data-set, and we may get some irregular shape when the complexity increases.
The Cost Function is defined as , what is the cost, the algorithm has to pay, while it predicts the hθ (x), with the label Y. In other words, the Cost Function evaluates the performance of our Machine Learning Algorithm.
And we want the cost function because we want to minimize it. So minimizing means that we decrease the error value and increase the accuracy of our model, and it is done by iterating over the training data-set and tweaking the parameters such as weights and biases of our model.
Let’s have a look at a training data-set which has m number of values as follows:
where, x is a feature vector, which has n number of features, and y is the label which can have a value of either 0 or 1. each x and y are represented as follows:
based on the values above, we’ve to choose the best value of theta θ for our hypothesis hθ (x), such that, our prediction are as accurate as possible.
For doing that, we will have to use the Cost Function. We also use a cost function for Linear regression, and it has a certain formula, but this can not be used here in case of Logistic Regression, simply because it will result in a non-convex curve representation of our parameter θ. Therefore, it will not be a suitable way to determine the global minimum using gradient descent.
Instead, what we want is a Cost Function, that will result in a bow-shaped curve, in other words, a convex function, such that the gradient descent would converge to a global minimum easily.
So the formula for the cost function, that is specific to Logistic Regression is represented as follows:
The important thing to understand here is that the purpose of the cost function is to penalize our learning algorithm if our hypotheses hθ (x) results in 0, but in reality y label=1, then the cost goes to infinity and same applies to the opposite case.
If we are to write the cost function in a single line, it will look as follows:
Thus, by using this cost function, we can use the gradient descent to optimize our machine learning model and come up with the best accuracy possible.
I’ve implemented the Logistic Regression on the famous titanic data-set to predict the Y which is the label named as “Survived” with the possible values of either 1 as Yes, and 0 as No.
The full code can be accessed through the following https://github.com/rfhussain/Using-Logistic-Regression-for-Prediction
This article is inspired by Andrew Ng’s Artificial Intelligence Lectures, and the book An Introduction to Statistical Learning by Gareth James & Daniela Witten.