My Machine Learning Diary: Day 59

5 min readDec 16, 2018

--

Today I completed week 2 of first course from Coursera Deep Learning Specialization. It was mainly about logistic regression and neural networks, which I already learned from Coursera ML.

Notation Changes

For anyone followed Andrew’s Coursera ML including myself, there are some notation changes we need to be careful with.

Matrix

It turns out it is more programmatically convenient to stack data in columns rather than in rows.

Unlike in Coursra ML where samples were stacked in rows, we will stack in column wise in this deep learning course.

Intercept Term b

In logistic regression, we included the intercept term b in the weights and added 1 on top of each sample x. We will give up this convention in this course.

Instead, we will explicitly write down the intercept term in our new notation.

Logistic Regression

Although we already saw how logistic regression works in Day 14 and Day 15, we kind of derived it intuitively. Let’s see its derivations more formally.

Notation

Given x, the goal of logistic regression is to find out the probability that y equals to one. To express this in math, we can write it as the follow:

We use y_hat to denote the predicted y. As we saw in Cousera ML, this y_hat is defined as such:

Definition of y_hat (σ is sigmoid function)

The only difference is that now we have to explicitly write the intercept term b. Thus, the dimension of w and x is n instead of n+1.

Loss Function and Cost Function

Loss function is defined as the error of one particular sample. Cost function is defined as the average error of the entire samples, namely, the average of the overall loss.

The reason we don’t simply use square error as the loss function is because it would be non-convex function, and thus have many local minima.

Derivation of Logistic Cost Function

This lecture note from Standford was helpful to understand the concept. We saw from Day 15 that the cost function makes sense intuitively, but how did someone come up with this function? The hypothesis of logistic regression can be written in the equivalent forms below.

Another form of logistic regression hypothesis 1

We can do a little trick and combine the two expressions into one. Note y is either 0 or 1.

Another form of logistic regression hypothesis 2

When we feed in a set of sample x and y from our training examples, we want the probability to be large. Namely, for a single data point, the goal of logistic regression is to maximize this probability P(Y|X). Now we take the log likelihood of P. Why? I found this post and this answer that gave me an intuition for the reasons. In short, it makes the math easier. So let’s take the log likelihood of the probability and get rid of the nasty exponents.

Log Likelihood of P(Y|X)

We see that maximizing the probability means minimizing the loss function. Therefore, we just multiply all the probability all the samples and take the log likelihood of it.

Again, to maximize the total probabilities, we want to minimize the average sum of all loss functions, which is exactly the definition of cost function we defined earlier.

Gradient Descent

Logistic regression can actually be seen as the simplest form of neural network. We forward propagate to get the prediction.

Then we back propagate to update the weights. Note we now have two types of weights — w and b. First we find out what the error is.

Logistic Regression Back Propagation: Loss Function

Note the derivatives of L with respect to w and b are the follow:

Logistic Regression Back Propagation: Derivatives of L w/ respect to w and b

Next, we compute δL/δz.

Logistic Regression Back Propagation: Compute δL/δz

Then, we compute the derivatives of L with respect to w and b using δL/δz.

Logistic Regression Back Propagation: Compute Derivatives of L w/ respect to w and b

Finally, we update the parameters w and b with the derivatives we computed.

Logistic Regression Back Propagation: Update Parameters

Python

Here are some useful things to know when we implement logistic regression with python.

Vectorization

We can vectorize the expression in gradient descent and make the program runs faster.

Derivatives of J w/ respect to w and b (Vectorized)

Broadcasting

In numpy, it automatically converts an operand into compatible shape, then performs the arithmetic operation.

Numpy Vectors

Sometime when we play around with numpy, we see vectors with shape (2,) and (2,1). What’s the difference between the two? The former is called rank 1 array. The latter is called column vector.

Rank 1 arrays don’t work well with Rank 2 vectors in arithmetics. So it’s good convention not to use rank 1 arrays. Also, using assert statements here are there to make sure everything is good.

That’s it for today. The assignment was very fun to work on:)