Gradient Descent for Logistics Regression in Python

Hoang Phong
11 min readJul 31, 2021

--

In supervised machine learning, besides building regression models to predict continuous variables, it is also important to deal with the classification task, starting from its easiest form known as binary classification which obtains two outcomes: the “0” - Negative class or the “1” - Positive class. A common threshold for these problems is 0.5, as the model will predict the input belongs to the positive class if the hypothesis function returns a value higher than 0.5, whereas attaching to the negative class if the value of the outcome is smaller than 0.5. If we want to utilize the hypothesis function of Multivariate Regression, the range of target values will go from negative infinity on the left to positive infinity on the right, making it unreasonable to classify an outcome with the value of 3,265,634. Therefore, for the binary classifier, we need to compute an algorithm that lives in the range 0 to 1, inclusively. This article will cover how Logistics Regression utilizes Gradient Descent to find the optimized parameters and how to implement the algorithm in Python.

Logistics Regression Institution

It is always a good idea to go through the way Logistics Regression finds a function to ensure its result is simultaneously larger than 0 and smaller than 0 before studying how Gradient Descent optimizes the Cost Function. Logistics Regression works pretty much the same as Linear Regression, as the model computes a weighted sum of the input features, then, estimating the probability that training belongs to a specific class instead of returning a particular result as the Linear Regression does. Take a look at its formula:

Logistics Function
  • y or hθ(x) = Hypothesis function (the dependent variable), taking the model parameters theta as inputs
  • θ0, θ1,…, θn = Weights or model parameters
  • x1, x2,…, xn = Predictors (the independent variables)
  • n = number of features

Or we can use the vectorized form for it with model parameter vector and instance’s feature vector. Complete of these values can be found here:

Vectorized forms

Function g() is a sigmoid function that enables the hypothesis function to result in a number between 0 and 1. The sigmoid function is defined as follow:

Sigmoid Function Graph

An interesting characteristic of this sigmoid line is that the function takes the value 0.5 as the threshold when z = 0 to output its prediction. More specifically, with positive input, the function will generate a probability larger than 0.5 (i.e predicting it as the positive class) or with a negative one, the negative class will be projected.

Model Prediction for Logistics Regression

Just as Multivariate Regression, we also need to evaluate the performance of the Logistics Regression before applying Gradient Descent to optimize the model’s parameters. Let’s jump to our cost function.

The cost function

At this point, machine learning newcomers may find it difficult to figure out how to generate the cost function for a binary classification when it just contains two values (0 and 1). The idea of evaluating the Logistics Model is to measure each difference created by the actual outcome and the predicted outcome. An important reminder here is we need to use the value generated by the hypothesis function before it makes predictions. This is because if there exist two models which return the estimated probability of 0.6 and 0.9, their predictions will both belong to the positive class. However, we much would rather use the 0.9 model than the 0.6 one. Here is the formula for assessing one training instance.

Cost function for a single training input

To gain a deeper understanding of why these functions are chosen, let’s visualize the function -log(x) and -log(1-x).

-log(x) Graph
-log(1-x) Graph

(Note: All the code for the graphs in the article can be found here:)

From the visualization, we can clearly observe that the choice of making these functions to evaluate the prediction for a single instance is appropriate. If the original input belongs to the positive class (-log(-x) Graph), as its projection comes close to 0 (i.e., making a wrong prediction), the cost function reaches positive infinity. This limit function demonstrated the property that the loss will be close to 0 if the prediction nears 1; on the other hand, it will heavily penalize the model if the estimated probability moves toward 0. This logic also applies to the negative class case, in which the model will be punished (a large cost) if it makes a wrong prediction and rewarded (a small cost) when the prediction is close.

That’s how we determine the loss made by a single input. By some proper combinations of mathematical formulas, the cost function for the model can be expressed in a single formula:

Cost function for Logistics Regression
  • J(θ) = The cost function which takes the theta as inputs
  • m = number of instances
  • x(i) = input (features) of i-th training example
  • y(i) = output (features) of i-th training example

The institution of this equation is still to calculate the average cost over all the training sets and by adding the terms y and (1-y), the function guarantees to use an appropriate formula based on the original output. The model is good if and only if its cost function is small; therefore, our goal is to regulate the model’s parameter vector to minimize the cost function by utilizing the Gradient Descent algorithm.

How does Gradient Descent work in Logistics Regression?

In optimizing Logistics Regression, Gradient Descent works pretty much the same as it does for Multivariate Regression. In short, the algorithm will simultaneously update the theta values after each model fits to find the global minimum of our cost function. Gradient Descent is able to perform such a task because taking steps in the opposite direction of the gradient will gradually lead us to the minimum of any function.

Updating theta values

The 𝛼 symbol represents the learning rate of the algorithm, which controls the rate at which the model learns. Choosing an appropriate learning rate is essential as it will ensure our cost function will converge in a reasonable time. If the model is failed to converge or if it takes too much time to determine its minimum value, the data implies that our learning rate is probably a wrong choice. However, in the Gradient Descent algorithm, the learning rate just plays the role of a constant value; hence, after taking partially differentiate of the cost function, the algorithm becomes:

  • x^(i)_j = value of feature j in i-th training example

Instead of calculating each theta individually, we can generate the gradient vector to compute them in just one step. Before coming to the gradient vector, it is necessary to denote the last component of our multivariate regression:

X is the matrix that contains all the values in the dataset, not include the values of the outcomes.

We then generate the formula of gradient vector for the cost function:

Gradient Vector

Therefore, to finalize our updated theta algorithm, we have:

OK! We have basically been through all the formulas that we need to implement Gradient Descent for Logistics Regression in Python. Let’s jump straight into the coding part.

Implementing Gradient Descent for Logistics Regression in Python

Normally, the independent variables set is not too difficult for Python coder to identify and split it away from the target set. However, as the value of x_0 is insignificant, datasets do not include these values in their capacity to reduce the computational work. Hence, our first task is to create a new row containing all ones to the independent variables set, matching with the definition of matrix X.

def generateXvector(X):
""" Taking the original independent variables matrix and add a row of 1 which corresponds to x_0
Parameters:
X: independent variables matrix
Return value: the matrix that contains all the values in the dataset, not include the outcomes variables.
""" vectorX = np.c_[np.ones((len(X), 1)), X]
return vectorX

For every Gradient Descent algorithm, it needs to start somewhere before applying the optimized approach to find the minimum of the function. Therefore, we need to generate a vector that contains the initial guess of theta.

def theta_init(X):
""" Generate an initial value of vector θ from the original independent variables matrix
Parameters:
X: independent variables matrix
Return value: a vector of theta filled with initial guess
"""
theta = np.random.randn(len(X[0])+1, 1)
return theta

It will be too complicated if we include the process of calculating the sigmoid value during the Gradient Descent function. Thus, it would be better if we separate these two processes.

def sigmoid_function(X):
""" Calculate the sigmoid value of the inputs
Parameters:
X: values
Return value: the sigmoid value
"""
return 1/(1+math.e**(-X))

Finally, with enough preparation for the main function, we start building our own Logistics Regression function. First, it takes two matrices as its training instance, a learning rate, and the number of iterations. It first reshapes the matrix y to match with the dimension of the target values vector in the gradient vector formula. The function follows by computing the upgraded gradient for each iteration, leading to a new model parameter vector that reveals a better performance. It then calculates the cost value of each iteration and stores the value to plot the cost function later.

def Logistics_Regression(X,y,learningrate, iterations):
""" Find the Logistics regression model for the data set
Parameters:
X: independent variables matrix
y: dependent variables matrix
learningrate: learningrate of Gradient Descent
iterations: the number of iterations
Return value: the final theta vector and the plot of cost function
"""
y_new = np.reshape(y, (len(y), 1))
cost_lst = []
vectorX = generateXvector(X)
theta = theta_init(X)
m = len(X)
for i in range(iterations):
gradients = 2/m * vectorX.T.dot(sigmoid_function(vectorX.dot(theta)) - y_new)
theta = theta - learningrate * gradients
y_pred = sigmoid_function(vectorX.dot(theta))
cost_value = - np.sum(np.dot(y_new.T,np.log(y_pred)+ np.dot((1-y_new).T,np.log(1-y_pred)))) /(len(y_pred))
#Calculate the loss for each training instance
cost_lst.append(cost_value)
plt.plot(np.arange(1,iterations),cost_lst[1:], color = 'red')
plt.title('Cost function Graph')
plt.xlabel('Number of iterations')
plt.ylabel('Cost')
return theta

Examine our Gradient Descent algorithm for Logistics Regression

Amazing! We have just finalized our implementation for Gradient Descent for Logistics Regression. Our next task is to check our code by comparing its ideal parameters with the LogisticsRegression’s result from SkLearn. In general, for the classification task, it is more proper to use the accuracy score as a method to evaluate the model performance; therefore, it required us a little more coding to produce the precision of our algorithm. We will use the famous data set for classification that is the Iris DataSet from SkLearn. The goal of this data set is to distinguish three different classes of Iris: Iris Versicolor, Iris Setosa, and Iris Virginica.

Source

However, for the purpose of this article, we are not going to classify all three species, as the Logistics Regression performs much better with binary classification. Therefore, our goal is to determine whether the input is an Iris Versicolor or not.

from sklearn import datasets
iris = datasets.load_iris()
X = iris["data"]
y = (iris["target"] == 0).astype(np.int) #return 1 if Iris Versicolor, else 0.

In order to gain an appropriate evaluation, we split the dataset into the training set and the test set, then, applying Feature Scaling for better convergence of the algorithm.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Using the LogisticsRegression function from SkLearn to train the data.

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, penalty = 'none')
classifier.fit(X_train, y_train)
classifier.intercept_, classifier.coef_
>>> (array([-11.07402312]),
array([[ -1.32289075, 4.23503694, -10.11887281, -9.22137322]]))

Here is the accuracy score produced by the built-in function.

y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
accuracy_score(y_test, y_pred)
>>> 1.0

Wow! The Logistics Regression model has a perfect score when determining the image of Iris Versicolor. Note that I set the parameter penalty equal to “none” in order to not apply regularization for the model. However, since the data set is not so complicated (small number of instances and features), the model still easily managed to find a perfect hyperspace to distinguish these variables.

Still, getting a perfect classification still be a challenge for us. We will implement some code to return the accuracy score by our model.

def column(matrix, i):
""" Returning all the values in a specific columns
Parameters:
X: the input matrix
i: the column
Return value: an array with desired column
"""
return [row[i] for row in matrix]
def accuracy_LR(X,y,learningrate, iteration,X_test, y_test):
""" Returning the accuracy score for a training model

"""
ideal = Logistics_Regression(X,y,learningrate, iteration)
hypo_line = ideal[0]
for i in range(1,len(ideal)):
hypo_line = hypo_line + ideal[i]*column(X_test,i-1)
logistic_function = sigmoid_function(hypo_line)
for i in range(len(logistic_function)):
if logistic_function[i] >= 0.5:
logistic_function[i] = 1
else:
logistic_function[i] = 0
last1 = np.concatenate((logistic_function.reshape(len(logistic_function),1), y_test.reshape(len(y_test),1)),1)
count = 0
for i in range(len(y_test)):
if last1[i][0] == last1[i][1]:
count = count+1
acc = count/(len(y_test))
return acc

Let’s first see how our optimization doing:

Logistics_Regression(X_train,y_train, 1, 1000000)
>>> array([[-10.81166363],
[ -2.36666046],
[ 5.55693637],
[-10.29389333],
[ -9.64684255]])

Our parameters are really close to the built-in generation. What about our accuracy score?

accuracy_LR(X_train,y_train, 1, 1000000,X_test, y_test)
>>> 1.0

Perfect! We also achieve a 1.0 score. Our algorithm runs smoothly and gets such a result!

Final Thoughts

First of all, thank you all for reading till the end. It means so much to me. The code for this article can be found here.

In my opinion, the way Gradient Descent works in Logistics Regression is the same as its operation for Multivariate Regression. Understanding one of these will pretty much help you go through the other one. These two algorithms are the most fundamental for machine learning as well as deep learning, so do make sure that you get the institution of applying Gradient Descent into them. Sharing any feedback, comments, thoughts would be so appreciated.

If you love machine learning, data science, or any technical problems nowadays, let’s talk with me via Linkedin.

--

--