Logistic Regression as a Neural Network

Published in

Analytics Vidhya

8 min readApr 25, 2019

Logistic regression is a statistical method which is used for prediction when the dependent variable or the output is categorical. It is used when we want to know whether a particular data point belongs to class 0 or class 1. In logistic regression, we need to find the probability that the output will be y=1 given an input vector x. y’ is the predicted value when the input is x. Mathematically it can be defined as :

Mathematical Model

Input: X is an input matrix of dimensions n x m where n is the number of features in X and m is the number of training examples.

Parameters: W is a Weight Matrix of dimensions n x 1 where n is the number of features in X. Bias b helps in controlling the value at which the activation function will trigger.

Output:

Activation Function

Activation functions are really important for an Artificial Neural Network to learn and make sense of something really complicated. They introduce non-linear properties to the network. Their main purpose is to convert an input signal of a node into an output signal. That output signal now is used as an input in the next layer of the Neural Network. The activation used above is the sigmoid activation function. Mathematically, it can be defined as:

Loss Function

Loss can be defined as the error that is present between the actual output and the predicted output. We want the value of loss function to be as low as possible as it would reduce the loss and the predicted value would be close to the actual value. The loss function that we use to train the neural network varies from case to case. Therefore it is important to select a proper loss function for our use case so that the neural network is trained properly. The loss function which we are going to use for logistic regression can be mathematically defined as:

Let us study why this loss function is good for logistic regression,

When y=1 the loss function equates to L(y’,y) = -log y’. As we want the value of loss function to be less, the value of log y’ should be more, which will be more when y’ will be more i.e close to 1 and therefore the predicted value and actual value will be similar.
When y=0 the loss function equates to L(y’,y) = -log(1-y’) . As we want the value of loss function to be less, the value of log (1-y’) should be more, which will be more when y’ will be less i.e close to 0 and therefore the predicted value and actual value will be similar.
The above loss function is convex which means that it has a single global minimum and the network won’t be stuck in local minimum(s) which are present in non-convex loss functions.

Cost Function

The loss function is used for each and every input training example during the training process whereas the cost function is used for the whole training dataset in one iteration. So basically, the cost function is an average of all the loss values over the whole training dataset. Mathematically it can be defined as:

In the above equation, m is the total number of training examples. The objective of training the network is to find Weight matrix W and Bias b such that the value of cost function J is minimized.

Gradient Descent

The Weight Matrix W is randomly initialized. We use gradient descent to minimize the Cost Function J and obtain the optimal Weight Matrix W and Bias b. Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. We apply gradient descent to the Cost Function J to minimize the cost. Mathematically it can be defined as:

The first equation represents the change in Weight Matrix W whereas the second equation represents the change in Bias b. The change in the values is determined by learning rate alpha and the derivatives of the cost J with respect to the Weight Matrix W and Bias b. We repeat the updation of W and b until the Cost Function J has been minimized. Now lets us understand how Gradient Descent works with the help of the following graph:

Case 1. Let us assume that W was initialized with values less than the values at which it achieves global minimum, then the slope at that point i.e partial derivative of J with respect to W is negative and hence, the weight values will increase according to the Gradient Descent’s equation.

Case 2. Let us assume that W was initialized with values more than the values at which it achieves global minimum, then the slope at that point i.e partial derivative of J with respect to W is positive and hence, the weight values will decrease according to the Gradient Descent’s equation.

Accordingly, both W and b will achieve their optimal value and the value of cost function J will be minimized.

Logistic Regression using Gradient Descent

Till now, we have understood the mathematical model of both logistic regression and gradient descent. In this section, we will see how we can use gradient descent for learning the Weight Matrix W and Bias b in the context of logistic regression. Let us summarize all the equations that we know so far.

The first equation denotes the product of an input X with the Weight Matrix W and Bias b.
The second equation is the sigmoid activation function which introduces non-linearity.
The third equation is the loss function which calculates the loss between a given Y and predicted Y’.

These equations can be modelled using a graph which is known as Computation Graph. For simplicity let us consider that there are 2 features x1 and x2 in a given input matrix X. Accordingly there will be 2 weights w1 and w2 in the Weight Matrix W. The Computation Graph for the above scenario can be defined as:

In the above figure, the forward propagation(black arrows) is used to predict y’ and the backward propagation(red arrows) is used to update the weights w1and w2 and bias b. As we saw in gradient descent, we need to calculate the derivatives of weights and bias to update them. Using the computation graph makes it easy to calculate these derivates. As the loss L, depends on a, first we calculate the derivative da which represents the derivative of L with respect to a using equation 3 as follows :

After that, we calculate the derivative dz which represents the derivative of L with respect to z using equation 2. This can be done using the chain rule as follows :

Similarly, we can find all derivatives dw1, dw2 and db using equation 1 and the chain rule. The value of these derivatives are as follows:

Now that we have found out all the derivatives, we need to update the weight and bias values with the help of gradient descent as follows:

With this, we have applied gradient descent to logistic regression and studied how the Weight Matrix W and Bias b are updated to that the loss L is minimized.

I suggest to the readers that if they have the knowledge of calculus, they should do the math and figure out the above equations themselves for proper understanding of the topic.

Note that all the above calculations are implemented for a single training example but during the training process we have to apply the above steps to all the training examples in a single iteration. Let’s figure out how we do that. We know that Cost Function J is mathematically defined as follows:

During the training process, we need to minimize the cost J . This can be done by minimizing the loss L for each training example i over all the m training examples. As the loss function L itself depends on the weights w1, w2 and b, we have to minimize J with respect to the weights w1 and w2 and bias b as follows:

Summarization

The above few sections explain the math behind the working of neural networks with the example of logistic regression. Let us sum up how we can implement logistic regression as a neural network in a few lines as follows:

This is the computation done in a single step of training over all the training examples. During training, we need to perform all the above steps for many iterations which can range from 1000 to 1000k depending on the task. We can stop the training process when accuracy does not improve or the cost J is minimized thus dropping the need to specify the number of steps. With improvements in hardware and software, the time needed for such a huge amount of computation has reduced drastically owing to the rise of Deep Learning.

Thus, in this story, we studied the math behind neural networks in the context of logistic regression.

References

Coursera — Deep Learning Course 1

I would like to thank the readers for reading the story. If you have any questions or doubts, feel free to ask them in the comments section below. I’ll be more than happy to answer them and help you out. If you like the story, please follow me to get regular updates when I publish a new story. I welcome any suggestions that will improve my stories.