Part III: Logistic Regression
Logistic Regression is one of the most widely used algorithms in the industry!
But, first, let us recap on our journey so far.
First, we went through a gentle introduction to ML which discussed ML on a high-level.
Then, we understood the general flow of any ML algorithm which involved the following steps:
- General structure of dataset (Labelled/Supervised or Unlabelled/Unsupervised)
- Discovering the underlying function (Y is the function of X: Y = f(X))
- Cost Function to represent the difference between Y_training (known labels) and Y_predicted (based on X_training)
- Training or determining model parameters that minimize the Cost Function
We studied all the parts in the flow of an ML algorithm for Linear Regression. This post will be an extension of the same flow for the mighty Logistic Regression.
Remember, how we saw that Linear Regression is essentially fitting a straight line to the given data points. Logistic Regression, in essence, is just an extension of Linear Regression.
The major difference between Linear Regression and Logistic Regression is that Linear Regression is used where the output variable is continuous while Logistic Regression is used where the output variable belongs to one of the given classes/categories (Classification).
The data is represented as given in the table below.
Let us take this step-by-step to keep it simple and easy.
Step I: Logistic Regression calculates weighted sum of input variables.
Let us assume the case with single variable again:
- X_training is Age
- Y_training was Height in cm for Linear Regression.
- For this case, let us assume Y_training is categorical with the class ‘Short’ representing Height < 100 cm and the class ‘Tall’ representing Height >= 100 cm. This categorization has just been done to make the output categorical.
In this step, the same computation as Linear Regression happens and a straight line is fitted.
The calculation made is: Step_I_Output = mX + c
If we use this Step_I_Output as Y_predicted, it will be a number while the Y_training (known) is actually categorical.
So, how do we solve this problem?
One way could be to define a threshold for the value of Y_predicted, say 100. If the value of (mX + c = Step_I_Output) is greater than or equal to 100, the particular instance will be assigned to the class ‘Tall’ and vice-versa (‘Short’ if Step_I_Output < 100).
However, there are multiple problems with this approach including the case that what happens when the value of Step_I_Output falls to less than 0.
In a classifier, it makes more intuitive sense to be able have an output between 0 and 1 which can be interpreted as a probability of belonging to a particular class. The class ‘Tall’ is taken as the positive class here and denoted as 1 while the other class ‘Small’ is taken as negative class and denoted with 0. This leads us to the second step.
Step II: Logistic Function
Step_I_Output is fed into a logistic function. Thus, the logistic of Step_I_Output is the final Step_II_Output which is used as the estimated probability of belonging to a particular class (‘Tall’ or positive class).
I think the previous sentence packed a bit too much in it. The obvious questions you may be thinking of are:
- What is the Logistic Function?
- What is the Step_II_Output? What is meant by the logistic of Step_I_Output?
- How can Step_II_Ouput be used as estimated probability of belonging to a particular class? How can it be suddenly a probability? After all, Step_I_Output is mX + c and can be any number! What sorcery is this?!!!
Let us start with “What is the Logistic Function?”
It is simple. It is just a mathematical function which takes a number as input and outputs another number, just like any other function! However, it is special in the sense that the output it gives is a number between 0 and 1. It is also called Sigmoid Function and has an S-shaped curve.
And make no mistake, this is no coincidence! This function has been deliberately chosen so that the Step_I_Output, which is weighted sum of features (mX + c) and can be any number, is converted into a number between 0 and 1 which can represent probability of belonging to a class.
Step III: Predicting Y_predicted for training
Once we have found Step_II_Output which is probability (p) that the instance belongs to the positive class, Y_predicted for the training instance can be easily obtained.
Now let us obtain a generalized version of this in vectorised form. If you remember, the vectorised form was written for a training instance as shown in the image below:
Cost Function: Log Loss
In Linear Regression, we used a straight line to estimate the actual relationship between X_training and Y_training. Then formulated a Cost Function of Mean Squared Error (MSE) to determine how far off our estimated relationship, which outputted Y_predicted, is from the actual one or Y_training. Here, higher the value of MSE, greater was the difference between estimated and actual relationship.
Similarly, for Logistic Regression, we need a Cost Function which indicates this problem i.e. the value of Cost Function needs to be high when the model predicts an instance or observation as belonging to a particular class (say ‘Tall’ or positive or 1) while it actually belongs to the other class (‘Small’ or negative or 0). Cost Function needs to be high when the model makes an error and misclassifies the instance.
The ultimate goal in training of an ML algorithm is to estimate the parameter vector m which minimizes the Cost Function, which in turn indicates that the estimated relationship between X_training and Y_training is close to the actual one.
For Logistic Regression, we need to find parameter vector m which ensures that the model outputs high probability (as close to 1 as possible) for any instance belonging to the class ‘Tall’ or positive or 1 and low probability (as close to 0 as possible) for any instance belonging to the class ‘Small’ or negative or 0.
The Cost Function shown in the image is called the Log-Loss function. Let us see the possible cases for a single training instance for this log loss function as the general equation looks a bit intimidating!
It is important to note that there is no mathematical or closed-form equation i.e. there is no solution available to us that resembles the Normal Equation we saw for the Linear Regression.
The approach to minimize this Cost Function, thus, use optimization algorithms like Gradient Descent by randomly initializing the weights and the reiterating towards minimizing the Cost Function.
Thus, Logistic Regression is a binary classifier in this form. It can be generalized and extended into a multi-class classifier known as Softmax Classifier or Multinomial Logistic Regression.
It uses Softmax Function instead of Sigmoid function. The Cost Function used here is Cross Entropy.
We will not be discussing these in detail here to keep things simple and keep us focused on Logistic Regression as binary classifier.
An often asked question:
Why is Logistic Regression considered part of generalized linear models when its actual output is not linear?
It is easy to see why it is considered a linear model if you are clear with nature the Step_I_Output. It is the weighted sum of input features (mT.X). Thus, these features are additive in nature. There is no term in which the input features are being multiplied or divided or have any other type of interaction between themselves.
So, that ends our journey with the basic flow and functioning of Logistic Regression!
I will be discussing the different ways to evaluate a model such as Confusion Matrix, AIC, AUC-ROC Curve in separate posts where we will go through them one by one.
With this background, you are now ready to go through Chapter 3 and 4 of ISLR (Introduction of Statistical Learning in R).
See you soon in the next post on Decision Trees! :)