My Machine Learning Diary: Day 59
Today I completed week 2 of first course from Coursera Deep Learning Specialization. It was mainly about logistic regression and neural networks, which I already learned from Coursera ML.
Notation Changes
For anyone followed Andrew’s Coursera ML including myself, there are some notation changes we need to be careful with.
Matrix
It turns out it is more programmatically convenient to stack data in columns rather than in rows.
Unlike in Coursra ML where samples were stacked in rows, we will stack in column wise in this deep learning course.
Intercept Term b
In logistic regression, we included the intercept term b in the weights and added 1 on top of each sample x. We will give up this convention in this course.
Instead, we will explicitly write down the intercept term in our new notation.
Logistic Regression
Although we already saw how logistic regression works in Day 14 and Day 15, we kind of derived it intuitively. Let’s see its derivations more formally.
Notation
Given x, the goal of logistic regression is to find out the probability that y equals to one. To express this in math, we can write it as the follow:
We use y_hat to denote the predicted y. As we saw in Cousera ML, this y_hat is defined as such:
The only difference is that now we have to explicitly write the intercept term b. Thus, the dimension of w and x is n instead of n+1.
Loss Function and Cost Function
Loss function is defined as the error of one particular sample. Cost function is defined as the average error of the entire samples, namely, the average of the overall loss.
The reason we don’t simply use square error as the loss function is because it would be non-convex function, and thus have many local minima.
Derivation of Logistic Cost Function
This lecture note from Standford was helpful to understand the concept. We saw from Day 15 that the cost function makes sense intuitively, but how did someone come up with this function? The hypothesis of logistic regression can be written in the equivalent forms below.
We can do a little trick and combine the two expressions into one. Note y is either 0 or 1.
When we feed in a set of sample x and y from our training examples, we want the probability to be large. Namely, for a single data point, the goal of logistic regression is to maximize this probability P(Y|X). Now we take the log likelihood of P. Why? I found this post and this answer that gave me an intuition for the reasons. In short, it makes the math easier. So let’s take the log likelihood of the probability and get rid of the nasty exponents.
We see that maximizing the probability means minimizing the loss function. Therefore, we just multiply all the probability all the samples and take the log likelihood of it.
Again, to maximize the total probabilities, we want to minimize the average sum of all loss functions, which is exactly the definition of cost function we defined earlier.
Gradient Descent
Logistic regression can actually be seen as the simplest form of neural network. We forward propagate to get the prediction.
Then we back propagate to update the weights. Note we now have two types of weights — w and b. First we find out what the error is.
Note the derivatives of L with respect to w and b are the follow:
Next, we compute δL/δz.
Then, we compute the derivatives of L with respect to w and b using δL/δz.
Finally, we update the parameters w and b with the derivatives we computed.
Python
Here are some useful things to know when we implement logistic regression with python.
Vectorization
We can vectorize the expression in gradient descent and make the program runs faster.
Broadcasting
In numpy, it automatically converts an operand into compatible shape, then performs the arithmetic operation.
Numpy Vectors
Sometime when we play around with numpy, we see vectors with shape (2,) and (2,1). What’s the difference between the two? The former is called rank 1 array. The latter is called column vector.
Rank 1 arrays don’t work well with Rank 2 vectors in arithmetics. So it’s good convention not to use rank 1 arrays. Also, using assert statements here are there to make sure everything is good.
That’s it for today. The assignment was very fun to work on:)