Logistic Regression Using Gradient Descent: Intuition and Implementation

Published in

Geek Culture

7 min readMay 17, 2021

This is part three of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read Linear Regression: Intuition and Implementation before you dive into this one. Simply because I introduce some concepts there that are very relevant to logistic regression, and I’ll be referring back to them on numerous occasions.

In this article, we go through the theory behind logistic regression, as well as see it in action using Scikit-Learn’s logistic regression class.

Let’s get right into it.

Case Study

The best way to understand the concept of logistic regression is through an example. So, while reading the rest of this article, imagine yourself in the following scenario:

You’re a data scientist living in New York City, where the number of COVID-19 cases is quickly rising. Your friend has a fever but he doesn’t want to waste $250 USD on a test if all he has is the flu. He asks you to train a model that will predict, based on his symptoms, whether or not he has COVID-19.

Logistic Regression

Despite its name, logistic regression isn’t used for regression, but rather, classification. In classification, the goal is to classify data into a discrete number of groups, based on their attributes. In our running example, we have two different possible groups: Positive and negative. Logistic regression is a supervised learning algorithm since it learns from pre-existing and labeled data in order to classify new, incoming, data.

Instead of giving us an absolute value, logistic regression will output a probability. The model gives a value between zero and one. For example, a value of 0.75 indicates that there’s a 75% chance that a patient has COVID-19. More formally, we wish to find:

We’ll look at the case where f(x)can only take two possible values, then we’ll leave the floor to you to figure out how we can extend this model to work with multiple classes.

Can linear regression be used to classify a patient as testing positive or negative? Technically, yes. Is it a good idea? Definitely not. Here’s why:

In linear regression, the dependent variable can take a continuous number of values. In our scenario, we need our model to decide between a discrete number of values.
In order for linear regression to work well, there needs to be a linear correlation between the dependent and independent variables.

Intuition

The term “logistic” comes from the use of the logistic function:

Just like the exponential function, the logistic function is used to model the exponential growth of a population, except it takes into account the different factors that affect a population’s carrying capacity. We can manipulate this function so that it outputs a value between zero and one. L is defined as the curve’s maximum value. We want this value to be equal to one. We aren’t looking to change the growth rate, so k is also equal to one. x_0 is defined as the x value of the midpoint. Setting it to zero, we get a midpoint of 0.5 and the equation is simplified to:

**Equation 2:** Simplified Logistic Function

Which gives us the following curve:

**Figure 1:** Logistic Regression Curve with L ,k, and x_0 = 1

A few important points can be made about this curve:

As x approaches positive infinity, f approaches one
As x approaches negative infinity, f approaches zero
There are asymptotes at y=1 and y=0
f(0)=0.5

Given a value for x, this function will spit out a value between zero and one. So what’s our x?

In the article on linear regression, we were able to come up with a general formula to be used when dealing with linearly correlated variables:

Equation 3: Multivariate Linear Regression

Again, this equation outputs continuous numbers, so it can’t be used in a classification problem. It’s useful, however, as it describes the relationship between the different features. If we can make it so that the value outputted by h is squeezed to a value between zero and one, then our problem is solved.

For simplicity purposes, set x_0 = 1 in equation 3, then we can represent it in vector form as:

Equation 4: Multivariate Linear Regression In Vector Form

Where Theta=[Theta_0,Theta_1,...,Theta_j], x = [x_0,x_1,...x_j and Theta^Tis the transpose of the row vector Theta. This can then be used in our logistic function, to get:

**Equation 5:** Logistic Function With x = h

And this is the equation we use for logistic regression. With the curve drawn in figure 1, and equation 5 we can conclude that:

If h >= 0 then p(f(x) = 1 | x) >= 0.5
If h < 0 then p(f(x) = 1 | x) > 0.5

In our case, if h>=0, we will predict a value of one. Otherwise, we predict a value of zero.

As with linear regression, all that’s left now is finding Thetas that will minimize h.

Gradient Descent

In part two of the series, we made it a point to emphasize the fact that gradient descent isn’t only used for linear regression. In fact, the algorithm we showed wasn’t generalized enough. Here’s a better way of describing the algorithm:

Where J is any cost function i.e. a function that communicates how well your parameters are performing.

For linear regression, our cost function was the MSE. For logistic regression, we can no longer use this. The steps to come up with the cost function for logistic regression are way beyond the scope of this article. As such, we’ll provide the equation and leave it up to you to dig deeper:

**Equation 6:** Logistic Regression Cost Function

Where Theta, x and y are vectors, x^(i) is the i-th entry in the feature vector x,h(x^(i))is the i-th predicted value and y^(i) is the i-th entry in the class vector y i.e. the i-th actual value. Inserting this in the algorithm defined above will give us a Theta vector that minimizes J.

Implementation

Let’s see how we can use Scikit-learn’s Logistic Regression class and built-in Breast Cancer Dataset classes to find the best parameters for our model. We’ll use the tumor’s radius to predict whether the tumor is malignant or benign.

First, we import the libraries we need:

# Scikit learn's built-in Breast Cancer dataset
from sklearn.datasets import load_breast_cancer# Library for scikit-learn compatible arrays and matrices
import numpy as np# Library for plotting nice graphs
import matplotlib.pyplot as plt

Then separate our feature (tumor radius) from our target variable (malignant or benign):

# Loads sklearn's Breast Cancer dataset
dataset = load_breast_cancer()# Set x as the tumour radius
X = dataset.data[:100,0]# Set y as the tumour type (malignant (0) or benign (1))
y = dataset.target[:100]

And split them so that 20% of our data is used for testing and the rest for training our model:

# Split data into 20% testing and 80% training
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Our data is now ready for use in our logistic regression model. The first thing we need to do is train it using our training set. Remember that logistic regression is a supervised learning algorithm, meaning it learns from previous data to predict the value of new, incoming, data:

# Train the model with our training set using logistic regression
from sklearn.linear_model import LogisticRegression# Run Gradient Descent to get the values of Theta
regressor = LogisticRegression()regressor.fit(X_train.reshape(-1,1),y_train)

We can see the values of Thetas and Theta_0 obtained:

print(regressor.coef_) # Theta Vector
print(regressor.intercept_) # Theta_0
>> [[-0.8996107]] 
>> [11.83182617]

Let’s see how well our model will perform against our test set by drawing its graph:

plt.scatter(X_test,regressor.predict(X_test.reshape(-1,1)),color='red')plt.title('Tumour Type vs Size')plt.xlabel('Tumor Radius')plt.ylabel('Malignant or Benign')plt.show()

**Figure 2:** Logistic Regression Predictions On Test Data

Conclusion

In this article, we went through the theory behind logistic regression, and how the gradient descent algorithm is used to find the parameters that give us the best fitting model to our data points. We also looked at how we can use Scikit Learn’s Logistic Regression class to easily use this model on a dataset of our choice.

Despite the large amount of information presented in this article, there is much we didn’t cover. Here are some things for you to think about:

Why can’t we use the MSE as the cost function? What happens if we run gradient descent using MSE for a classification problem? Try to draw out the curve of J and see what the result is.
How can we extend the logic used in this article to use logistic regression for multiple class classification? For example, classifying a product as a fruit, vegetable, or other. Here, there are three different classes.
Try using linear regression on a classification problem to see how poor the results will be.
Try understanding how we got the cost function for logistic regression.
Why was there only one value in our Theta vector?

Past Articles

Part One: Data Pre-Processing
Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation