Logistic Regression (now with the scary maths behind it!)
Logistic Regression is a type of linear model that’s mostly used for binary classification but can also be used for multi-class classification. If the term linear model sounds something familiar, then that might be because Linear Regression is also a type of linear model.
To proceed with this notebook you first have to make sure that you understand ML concepts like Linear Regression, Cost Function, and Gradient Descent and mathematical concepts like Logarithm and Matrices. If you don’t, then the links below can help you out.
If you would like to follow the topic with interactive code then, I have made a Kaggle notebook for this exact purpose. Click here to try it out for yourself!
To understand Logistic Regression properly, you have to first understand a few other concepts too. Think of Logistic Regression like an onion. In the same way, you have to go through multiple layers to reach the sweet juicy middle part of an onion, you have to go through a few concepts before you can understand Logistic Regression from scratch!
(When did onions have a sweet juicy middle part?)
(I don’t know…he probably meant…some fruit?)
Onions aside, let’s first learn about the Decision Boundary.
Decision Boundary
In the simplest terms, a decision boundary is just a line that can help us in identifying which point belongs to which class. The image below can help you understand a decision boundary much more clearly.
Here the blue line separates the two classes which are represented as green and red dots. Any point to the left of the decision boundary belongs to the class represented with the red dots. Any point to the right belongs to the class represented with the green dots. That’s all that a decision boundary does.
It can be calculated using the equation of the straight line itself. The equation of the straight line in the general form can be given as this -
Where
- a is the coefficient of x,
- b is the coefficient of y.
- c is some arbitrary constant.
Using this equation we can assume that the equation of the decision boundary is -
Where,
- x1 is the 1st feature variable.
- x2 is the 2nd feature variable.
If we are able to calculate x2 values for certain x1 values, then we would be able to plot our decision boundary. This can be done this way.
Now that we have a way to plot the decision boundary, you might think “Why don’t we use Linear Regression for this? It can help us plot a line based on β values.”
It is true that Linear Regression can help us plot a line based on some β values, but the Cost Function of Linear Regression minimizes the distance between the line of best fit and the actual points. This isn’t helpful in classifying points. For ideal classification, we would need to get the probability of something belonging to a certain class and assign that item a class only if the probability is above a certain threshold. From this, you can infer two things,
- For Logistic Regression, we’ll need a way to get the values in terms of probabilities.
- For Logistic Regression we would need a new Cost Function.
Sigmoid Function
The sigmoid function looks something like this
It takes in any series and gives out that series in the terms of probabilities, which restricts it from 0 to 1. Let’s take an example of this.
Suppose I have a list of numbers from -100 to 100, {num | num ∈ [-100, 100]}. If I pass this list inside the sigmoid function, it would be turned into something like this.
The graph above gives us the probability of a number being greater than or less than zero. If we say that each number with a corresponding sigmoidal value that is greater than 0.5 is greater than 0, and each number with a corresponding sigmoidal value that is less than 0.5 is less than 0 then we would have the list of all positive and number numbers present in our input list.
We can try to predict the class of an item using β0+β1x1+β2x2. If we plot this line on a graph it would look something like this.
This line has a problem. No matter what your class names are, one of them is considered class 1 while the other is considered class 0. Meaning that our predictions should always be in the range of 0 to 1, which is something this line doesn’t do. So to fix this, we would pass it inside the sigmoid function. This would make the equation look something like this,
This equation can be written in the terms of matrices.
Where -
- B is the matrix with all the regression coefficients.
- X is the matrix with all the feature values with an added column with 1s.
The sigmoid function can help us in differentiating two classes but only when we have the equation of the ideal line to pass into the function. And how can we get the equation of the ideal line? It’s simple. By minimizing the cost function for Logistic Regression.
Cost Function
Just like Linear Regression had MSE as its cost function, Logistic Regression has one too. So let’s derive it.
Likelihood Function
So…we know that Logistic Regression is used for binary classification. Meaning the predictions can only be 0 or 1 (Either it belongs to a class, or it doesn’t). So suppose, the probability of something belonging to class 1 is p, then the probability of it belonging to class 0 would be 1−p.
We can combine these two equations into something like this.
If we substitute y with 1 we get the following.
If we substitute y with 0 we get the following.
This equation is called the likelihood function, and it can give us the likelihood of one item belonging to a class. To get the likelihood function of all the items in a series, we can just multiply the likelihood of all the items.
Log-Likelihood Function
When we start applying it to a series, the likelihood function would return huge numbers. This would complexify our calculations. So to tackle this problem we can take the log of this function.
This function takes in the values of pi and 1−pi which range from 0 to 1 (it takes in probabilities).
Let’s plot a log of numbers that fall between 0 and 1.
As you can see the log of numbers from 0 to 1 is negative. Meaning the whole function P(y) would be negative for all the inputs. So we would multiply −1 with P(y) to fix this.
And one more thing. ∑ni=1(yi log pi+(1−yi)log(1−pi) gives us the sum of all errors and not the mean. So to fix this we can divide the whole equation by n to get the mean of all errors.
And to avoid overfitting, let’s add penalization to the equation just the way we added it to the cost function for Ridge Regression.
This function we have here is also called the Regularized Cost Function and can help us in getting the error values for certain values of βs.
Gradient Descent
Now that we have our Cost Function all we need to do is find the minimum value of it to get the best predictions. And we can do this by applying partial differentiation to the function.
According to the Convergence Theorem, the ideal β value can be calculated using the equation below.
All we need to do is find the value of ∂J/∂βn for each β and we are good to go.
We know the Cost Function so we can get the value of ∂J/∂β0 by applying partial differentiation to it.
We know that
On adding −1 and 1 to the above equation, we get
On substituting ∂pi∂β0 in the derivative of the cost function with respect to β0, we get
Similarly, if you differentiate J with respect to β1, you will get
In general, for βn you will get
Now that we have the Cost Function and a way to implement Gradient Descent on it, all we need to do is run a loop for some number of iterations to get the best values of all the βs for our classification problems.
Once we have them, we can use them to create a line and pass it inside the sigmoid function. Which would look something like this,
If you compare this to the line in Image 7 you can see that it overcomes the shortcoming the previous line had. The values predicted by this line are between 0 and 1.
Once we have the ideal β values we can pass them into the equation in Image 4 to get the Decision Boundary.
Sources -
- StatQuest on Youtube,
- Article on Analytics Vidhya by Anshul Saini.
- Article on Medium.com by Asha Ponraj.
- Article on KDNuggets by Clair Liu.
- Article on satishgunjal.com by Satish Gunjal.
- Article on Log Loss by Megha Setia on Analytics Vidhya.
- Kaggle notebook by Rishi Jain.