Logistic Regression Explained for Beginners
Logistic Regression (LR) is a popular algorithm borrowed by the Machine Learning community from the field of Statistics.
In this blog, we will learn how Logistic Regression works and how it can be used to make predictions on unknown data.
⚠️ Please note that knowledge of fundamental Calculus and linear algebra is a prerequisite for this article. You can brush up on your Math concepts here.
Logistic Regression is a form of supervised learning where we use labelled datasets to train algorithms to classify data or predict outcomes accurately.
In other words, for a given pair of (X, y), we want the prediction values ŷ for the input X against the known labels y (also called True labels or Ground truth).
This means that if our input (X) shows the picture of a cat, then the output ŷ also confirms it.
⚡⚡ Even though it’s called Logistic Regression, it is actually a Binary Classification algorithm.
Therefore, it can only produce two distinct outcomes — 0 and 1 for the value of ŷ, i.e. in this case, the value of ŷ is either a cat or not a cat.
Mathematically, the above statement can be formulated as follows:
Now, let us assume that ŷ can be defined as the following:
We define σ(z) as below:
🎯 The goal of Logistic Regression is to learn the parameters w.T and b so that ŷ becomes a good estimate of the chance of y being equal to 1.
To change the parameters w and b we need to require the following:
- A Cost function (J)
- An Optimisation Algorithm (here, Gradient Descent) to minimise the aforementioned Cost Function.
Cost Function
While Loss Function (L) refers to how well the model is estimating the relationship between the input features (X) and the output (Y) for a single training sample, Cost Function (J) explains the same for all the m training samples in the dataset.
You can find a detailed explanation of Loss and Cost functions for Logistic Regression here.
Here, we define Cost Function (J) as the mean or average of Cross-Entropy loss.
Therefore, we can mathematically formulate J as the following:
🎯 Our goal now is to minimise the aforementioned Cost function. For this, we will use the Gradient Descent algorithm, explained in the next section.
Optimisation —Minimising Cost using Gradient Descent
Intuition
The intuition behind Gradient Descent is to iteratively find the point at which the cost function is minimum.
📌 Gradient Descent greedily takes the direction of steepest descent as quickly down as possible until it converges to the global minima.
Here, direction refers to how model parameters should be altered to further reduce the cost function and convergence indicates that the values of the model parameters do not change significantly any longer.
Algorithm
- In this algorithm, we initialise w, and b as any values (either 0 or random).
- We try to adjust (update) the value of w and b so that the cost function J is reduced.
- Mathematically, Gradient Descent can be expressed as follows:
📌 This means that we need to know the slope of the function at the current setting of the parameters so that we can take these steps of the steepest descent and know which direction to step in in order to go downhill on the cost function J(w, b).
- The Learning Rate denoted as α (alpha), is a hyperparameter that needs to be tuned during the training of our models.
It signifies how fast should the algorithm move down the slope.
For larger values of α, Gradient Descent will take bigger steps and similarly for smaller values of α it will take smaller steps. - The algorithm stops when the values of w and b do not significantly change any more.
It is then said to be reached a point of convergence.
⚡⚡ Choosing the right learning rate can often be a challenge. On one hand, a value of alpha that is too small may result in a long training process which can get stuck. On the other hand, a value too large can cause undesirable divergent behaviour where the loss function totally skips the point of minima and keeps oscillating.
Model Prediction
When the Gradient Descent finally converges, it means that the algorithm has found the values of w and b at which the Cost function J is at its minimum.
These values of w and b are now used to calculate the value of ŷ, which indicates the probability of y being equal to 1, for a given input value of X (as indicated in Fig.2).
In the classification problems, we define a Threshold (also sometimes called a Decision Boundary). If our output ŷ is above that threshold, we classify it as positive else it is negative.
The threshold value can vary from case to case and usually depends on how strict we want our model to be in terms of classification.
For example, in the case of the image classifier above, if our threshold is 0.5, then for the values of ŷ greater than or equal to 0.5, we classify them as 1 (interpreted as cats) and below 0.5 are classified as 0 (interpreted as not cats).
Conclusion
Logistic Regression forms the foundation of the basic classification algorithm on which other algorithms (including Neural networks) are based.
Even though for real-life Python applications, it is possible to skip the implementation details when using pre-existing modules from Scikit Learn. However, the conceptual details are essential to develop a deeper understanding of the subject.
Note: For simplicity, we have assumed that derivatives and partial derivatives of functions are the same.
The graphs in this article are only meant for representational purposes and are not to scale.
✨ This blog series is inspired by the notations and theory covered by the Coursera Deep Learning specialisation by DeepLearning.ai.
✨ If you like the article, please subscribe to get my latest ones.
To get in touch, contact me on LinkedIn or via ashmibanerjee.com.