Deep Learning Specialization

Neural Networks and Deep Learning Course Notes

Madhuri Jain
Analytics Vidhya
5 min readDec 28, 2020

--

I started writing the notes for this particular course and this article contains week two notes. If you have not referred, week 1 notes please check out this article.

Week 2

Binary Classification

Logistic Regression is an algorithm for binary classification. Let's take the problem of identifying cats with some input images. Here X would be some set of images and y would be an output label as 0 or 1. Further, let’s look at how the images are represented on a computer. They are stored as three separate channels of red, blue, and green color. So if the image is 64 X 64 pixels, then we will have three 64 X 64 matrices corresponding to red, green, and blue pixel intensity values for the images. To turn these pixel intensity values into a feature vector, we will unroll these pixel values into an input feature vector x. To un-roll the values, we will first place all the pixel values of the red channel in the vector then subsequently values of the blue and green channel. So if the image is 64 X 64 pixels then the total dimensions of the x input vector would be 64 X 64 X 3 = 12,288.

Notations

  • A single training example is represented by a pair (x, y) where x= N-dimensional feature vector and y is output {0,1}
  • m is a number of training examples. {(x¹, y¹), (x², y²)…(xᵐ, yᵐ)}
  • To combine all the training examples in one notation, we will use X where we define all the input features in columns as shown below. Thus, X is {Nx, m} dimensional metrics.
Image Credit: self
  • Y would be equal to [ y¹,y², y³, y⁴……yᵐ] thus {1,m} dimensional metrics.

Logistic Regression

Logistic regression is the learning algorithm that we use when the output y is either 0 or 1. Given the input feature x, we want to predict the probability of the chance that y^ (Y hat) is equal to y.

Given the parameters w and b, we can rewrite the linear regression formula using the sigmoid function as follows:

In the above formula, Wˣ+b is represented by Z. If on the horizontal axis, Z is plotted, then the function sigmoid of Z looks like the following image. It goes smoothly from zero up to one and crosses the vertical axis as 0.5.

Image Credit: Andrew -Ng

Sigmoid of Z is represented by 1/1+e-ᶻ so if the value of Z is very large, e-ᶻ becomes very small thus leading to the overall value of 1. If Z is a minimal negative value then e-ᶻ becomes a huge number leading to the overall value to zero. So while implementing logistic regression, our job is to try to learn parameters w and b so that y-hat becomes a good estimate of the chance of y being equal to one.

Logistic Regression Cost Function

To train the parameters w and b of the logistic regression model, we need to define a cost function. It is also known as loss function that we will need, to measure how good y-hat when the true label is y. The function is as follows.

We would want this function to be as small as possible thus making sure that the y-hat has been predicted correctly by logistic regression. In order to see how this loss function makes sense, let's take two scenarios.

  1. When y=1, replacing y’s value in the above equation gives L(y^,y) = -log(y^). So, if we say y=1 we will have to ensure y-hat is as big as possible. In logistic regression, we are using the sigmoid function so y-hat’s value can not exceed 1 hence it has to be close to 1.
  2. When y=0, replacing y’s value in the 1st equation gives L(y^, y) = -log(1-(y^). So if we say y=0, we will have to ensure that log(1-y) is as large as possible considering the negative sign and thus y-hat should be as small as possible.

Equation 1, is for a single training example, and for the whole training set, the loss function can be written as follows.

Gradient Descent

Image Credit: Andrew-Ng

In the illustrated diagram, the horizontal axes represent w and b parameters and the axis above these horizontal axes represents the cost function. Our goal is to find the value of w and b such that the value of the cost function is minimum.

The cost function J(w,b) which is used here is a convex function. Hence the gradient descent looks like a big bowl as opposed to several small local minima.

Further, to get the best values of parameters, we will have to initialize them to some initial values. Normally, with logistic regression, the values are initialized to zero. Gradient Descent will start at an initial point and downhill towards the direction of the steepest descent as quickly as possible to reach the global minimum as represented in the diagram by a red dot. It will take the value of w and b and update it as follows.

Where alpha is a learning rate. Along with the learning rate, the partial derivative of w and b will help in updating the value of w and b respectively.

Python and Vectorization

While calculating loss and gradients for m training examples, one has to iterate each example via for loop, and having explicit for loops in the code makes the algorithm run less efficiently. In the case of deep learning algorithms, we are going to move to bigger and bigger datasets. So running the algorithm without explicit for loops becomes extremely important and will help to scale to much bigger datasets. Hence one of the really important techniques that come to the rescue is vectorization.

Non-Vectorized Implementation of logistic regression in python:

Vectorized Implementation of logistic regression in python:

For week 3 and 4 notes please click here.

--

--

Madhuri Jain
Analytics Vidhya

Helping ambitious tech moms (25-35) balance career and family by mastering emotional wellness.