Logistic Regression as Neural Network, inspired by Andrew Ng

Sadia Afrin
4 min readJun 12, 2020

--

How computer represents an image

Logistic Regression is an algorithm which we use as a supervised learning algorithm. We can also see logistic regression as a small neural network.

Training set :

Suppose, we have an input image of 64 pixels by 64 pixels. In that case, we would have three 64 by 64 matrices corresponding to the red, green and blue pixel intensity values for our images. So the total dimension would be 3*64*64 =12228. The first thing we will do is to unroll the picture and assign a vector called X to give it a mathematical shape. X is basically a one dimensional vector with each pixels you can see from the picture above.

X = [306
376
451 …]

So we can consider nx = 12,228 as the representation of dimension for input feature X. Our goal from this dimension would be to predict whether this picture is a cat picture or not.

Let’s say we have a training set where the above picture is a cat picture if labeled as 1 and 0 otherwise. Now we will build a model using logistic regression and see how efficiently our model can predict the label of the image. One thing to remember is, we want our model to predict an output and so we will always consider our output as a prediction. It’s like I am asking my model, “what is the chance that this picture is a cat picture?”

The formula of prediction:

Given the parameter, w and b of logistic regression we will find the value of y hat, y^ = w^T + b. We often use this formula for linear regression but it’s not a good idea to implement the same function for binary classification. Why? In binary classification we want y^ to be comparable with 1 or 0 and it’s quiet difficult to expect a value like this because w^T + b can produce a very large value or maybe a negative one too.

Sigmoid function:

To tackle this issue we will use sigmoid function which will help us to keep the value of y^ between 0 and 1, no matter what. For a negative value it will be close to zero and for a very large value it will be close to 1.

A = sigmoid(np.dot(w.T,X) + b)

Computing cost:

Now the next thing we can do is to define a Loss function called L to measure how good our output y^ is when the actual label is y. If we can get L for a single training example, we can also find the total cost of the entire training examples by adding the loss or making summation of them. We define the cost function as J (w,b).

#m is the number of training example in the below code.cost = J(w,b) = -1/m* (np.dot(Y,np.log(A).T) + np.dot((1-Y),np.log(1 - A).T))

Gradient descent:

We talked about training examples, loss function and cost function. Now we will talk about gradient descent to train our parameter w and b which will help us to get the lowest possible cost.

Here is an illustration of gradient descent. Our target is to find w and b which will give us the minimum of cost function J (w,b).

Gradient Descent

Our cost function should be somewhere above the surface of w and b. Cost function, J is basically a convex function which looks like a bowl. We can randomize or initialize the value of w and b initially. With Gradient Descent function, we will move to the downhall direction in each iteration in a hope to reach the global optima or minimum optima (indicating the red dot in the above picture).

Calculating cost function is called forward propagation as we are moving forward. While doing forward propagation we will also keep track of derivative of w and b with back propagation. Why?

#backward propagationdw = (1/m) * (np.dot(X, (A-Y).T))
db = (1 / m) *(np.sum(A - Y))

Answer to the question is, there is no doubt that in each iteration, as we are getting close to get optimum value of w and b, we are updating our w and b too. To update the the w and b we need the derivatives of them. See the code below. Here we are using learning rate. it is referred to the amount the weights are updated. I won’t elaborate the term here.

#Updating w,b 
w = w - learning_rate*dw
b = b - learning_rate*db

Predict:

The last step will be to predict the output where we will reuse the sigmoid function we mentioned initially but with updated w an b for a lower cost.

A = sigmoid(np.dot(w.T,X) + b)

One thing that I would like to mention is, a lower cost doesn’t mean a better model. We have to see if the model is overfitting. Overfitting is a concept where the training model gives a lot higher accuracy than the test accuracy. We will discuss about this in our next story.

Thank you for reading.

--

--

Sadia Afrin

A Computer Science grad | A Web Automation Developer