Logistic Regression with Gradient Descent Explained | Machine Learning

What is Logistic Regression ? Why is it used for Classification ?

Ashwin Prasad

Published in

Analytics Vidhya

7 min readJun 14, 2021

What is Classification Problem ?

In general , Supervised learning consists of 2 types of problem setting.

Regression : It is the type of problem where the data scientist models the relationship between the independent variables and the continuous dependent variable using a suitable model and used that to give accurate predictions for future input data.
For example,
Predicting the sales of Ice cream on a given day, given the temperature.
Here, Sales of Ice Cream is a continuous variable which means it can take any value.
eg: 500, 100, 10,000, 52,123, 931…etc.
Classification: It is a type of problem where the data scientist has to find the relationship between the independent variables and the Discrete Dependent variable and use that to give accurate future predictions.
For example,
Predicting whether it wound rain or not based on the temperature and humidity readings. if it rains, the output would be 1. If it does not, it would be 0.
Here, the value to be predicted can only take certain values. In the above case, rain or does not rain ( 0 or 1 ).

Before continuing further, refer to Linear Regression with Gradient Descent for an understanding of how linear regression works and how an algorithm called gradient descent is the key for working of Linear Regression. The same gradient descent algorithm is the one we will be using in logistic regression and a lot of things will be similar with the above mentioned post. So, reading that first will lead to better understanding of logistic regression.

Will the Linear Regression approach also work for Classification ?

Let’s consider a problem statement. Given the size of the Tumor, we need to predict whether it is malignant or benign. ( y=0 for benign and y=1 for malignant ).
plotting the size of the tumor against the tumor type has given us the graph on the left.

In the regression setting, we could fit a line of best fit and predict that the points above a particular threshold on the line are malignant and the points below that points are benign.

There are some problems with this approach:

The Value of Y can exceed 1 and go below 0. But, we want an output as a probability of occurrence of the expected output. with this probability, we should be able to classify the output.
Leverage points may affect this line drastically.

The Sigmoid Function and Making Predictions

The sigmoid function is the magic function that let’s us get the desired probability as an output.
It does this by squishing any value to fit between 0 and 1.

Fig 2.1 represents the sigmoid function. It’s mathematical formula is sigmoid(x) = 1/(1+e^(-x)).

Similar to linear regression, we have weights and biases here, too. We first multiply the input with those weights and add it with the bias. The end result of this would go into the sigmoid function to give us a probability between 0 and 1.

z = x*w + b
where w is the weight and b is the bias.
h_theta = sigmoid(z)

These weights and biases are chosen randomly at first. But, The gradient descent algorithm will make sure that these parameters are updated to do a good job to do a classification task.

With the help of this sigmoid function, we can successfully predict the output in terms of a probability

Intuition behind Logistic Regression Cost Function

As gradient descent is the algorithm that is being used, the first step is to define a Cost function or Loss function.
This function should be defined in such a way that it should be able to tell us how much the predictions of our model deviates from the original outcome.

So, How do we define such a function ?

In figure 3.1, Cost(h_theta(x),y) is the function we have been looking for. but, how does this function work ? To understand this, we need to split the function into 2 parts. One part will explain the function when the actual output is 1 and the other will explain the function when the actual output is 0.
To put it simply, We would like our model’s prediction to be as close to 1 as possible when the actual known target is 1, and close to 0 otherwise.

In the equation of J(theta), Y represents the actual target value and h_theta is our model’s output. h_theta will be explained down below. But, Let us assume that our model already have a way to make predictions and we have a defined h_theta.
These predictions will lie between 0 and 1. So, we’ll get a probability as an output.

Part 1 : When Y = 1

When the actual target is 1, we want our model’s prediction to be close to 1 as possible. So, Our cost function should increase the penalty as our model’s prediction goes farther away from 1 and towards 0. Our model’s penalty should decrease as it’s prediction comes nearer to 1. So, Our objective now is to define a function for this purpose
and that function is nothing but: - log(x)

In fig 3.2, consider the y axis to be the cost and the x axis to be the model’s prediction. Note: our model’s prediction won’t exceed 1 and won’t go below 0. So, that part is outside of our worries.

when model’s prediction is closer to 1, the penalty is closer to 0 . As it moves further from 1 and towards 0, the penalty increases. Sol, this function can be used when the actual target is 1.

Part 2 : When Y = 0

Similarly, when Y is equal to 0, we wan’t our model’s predictions to be as close to 0 as possible. Which means lower penalty for values closer to 0 and higher penalty for values farther away from 0 and towards 1.
So, The appropriate function for this is -log(1-h_theta(x))

fig 3.3 (second part of the cost function )

Fig 3.3 represents this second part of the cost function. That is, -log(1-h_theta(x)).

Consider the X-axis to be the value our model predicts and the Y-axis to be the penalty that the model gets assuming that the original target is 0.

The 2 parts of the cost function are prepared. To ensure that the first part activates when y=1 and the second part doesn’t interfere and the second part activates when y=0 and the first part doesn’t interfere, we add the y and the (1-y) terms to the cost function.
At the end , We get the cost function mentioned in fig 2.1 highlighted in blue.

Gradient Descent and Cost Function Derivatives

Now that we have defined a cost function, the aim is to find the optimal w and b such that it minimises this cost function for our data-set . This is where Gradient Descent comes In. By doing this, the model learns the parameters to reduce it’s penalty thus making much more accurate predictions.

The algorithm of Gradient descent won’t be explained again. That is explained in the link that was referenced above

we would like to find how the cost changes with respect to w and b, so as to change the original w and b slowly to get the optimal parameters.
The derivation for that gradients of the logistic regression cost function is shown in the below figures

After finding the gradients, we need to subtract the gradients with the original w and b. We subtract so that we move the values of gradients in the opposite direction to the slope so as to make sure the cost is decreasing.

Cost function is a function that tells us how much our model deviates from the most ideal model that we can create. So, making sure that parameters are optimised in a way to reduce this cost function will ensure that we get a good classifier, assuming that the points are linearly separable and some other minor factors.

Conclusion

Similar to Linear Regression, we define a cost function that estimates the deviation between the model’s prediction and the original target and minimise it using gradient descent by updating the original w and b.
This ensures that we can use these w and b to make future classifications using the model. The continuous output is converted to a probabilistic output using the sigmoid function.

Hence, the only 2 differences between the logistic regression and linear regression is the cost function and the sigmoid function in the logistic regression that makes it suitable for a classification problem setting.