Coursera’s Machine Learning Notes — Week5, Neural Network — Lost Function, Forward-and-Backward Propagation

Amber

5 min readJan 19, 2019

Notes on Coursera’s Machine Learning course, instructed by Andrew Ng, Adjunct Professor at Stanford University.

Let’s continue with what’s left over from last week.

Forward Propagation

This procedure is for prediction.

The steps are described as below:

Firstly, we need a weighting matrix Θ (or denoted W) for each hidden layer. We can derive them randomly or per prior-knowledge.
Start from Input Layer X (or denoted a¹).
Do Forward Propagation throughout the Output Layer and get the final value.
Use the final value for prediction.

More Details

Let’s take binary classification problem as example and visualize the detail of Forward Propagation. First of all, we need a completed NN architecture, that is, all weighting matrices W are known already.

Input Layer: 4 units, Output Layer: 1 units, Hidden Layer : 6 units.

Forward Propagation (1)— From Input Layer to Output Layer

Forward Propagation(2) — From Hidden Layer to Output Layer

In the end, we can get the final value from Output Layer, and use it for prediction.

If final value ≥ 0.5, we predict the label y=1.
If not, then we predict the label y=0.

Multi-Classification Problem

In multi-classification problem, there are more than one unit in the Output Layer. Each unit represents how certain the data belongs to a category. In this situation, we can apply the one-vs-all strategy to decide the prediction result. Note that the Forward Propagation is same as binary classification problem.

Loss Function

Loss Function is also called Cost Function, but this name is used more commonly in NN.

The goal of Loss Function is to measure the error that the model made. Here, we give a generated formula about loss function. Note that a node of NN is a logistic unit with Sigmoid (logistic) Activation Function.

Backward Propagation

This procedure is for minimizing Loss Function, just like the technique — gradient descent we used in previous note.

The way we used here is similar to Gradient Descent and there are two steps:

Compute the partial derivative of J(Θ).
Update each element of weighting matrix Θ.

Understanding with an Intuitive Way

In the course, professor Ng gives a intuitive way to think of Backward Propagation. Let’s visualize the procedure by using the result of partial derivative of J(Θ).

Let’s start from another example, assuming weighting matrices Θ (or denoted W) has been initialized. Our goal is to minimize the J(Θ) and then update the weighting matrices Θ.

Note — We can use random method for initializing the weighting matrices Θ if we have no prior knowledge of the problem.

After Calculus calculation, we can get the result of partial derivative of J(Θ).

Let’s dig into the meaning of δ.

δ³ is the error of layer3, the Output Layer in this example.
δ² is the error of layer2, the Hidden Layer in this example.
δ¹ is not exist since layer1 is the Input Layer.

With the result of partial derivative of J(Θ), now you can update the weighting matrix Θ.

More Details about Partial Derivative of J(Θ)

In this section, we want to give some hints about the derivations of the formulation above.

For convenience’s sake, we use the same example above and suppose there is only one data. Hence, the loss function J(Θ) is simpler than the general form.

In order to do the backward propagation, we need to do the forward propagation first.

Then we can do Partial Derivative of J(Θ). Here, we show the Partial Derivative of two elements of W¹ and W² respectively. If you are interested, you can solve it with hints.

Finally, you can prove the formula of partial derivative of J(Θ) if you follow the procedure above. Don’t worry about it, if you are not familiar with Calculus, we have given you enough concepts in the section of Understanding with an Intuitive Way.

How to implement NN in practice?

Here is the reference for exercise, we implement NN without using deep learning library. We hope the tutorial can help you understand the architecture of NN and the Forward and Backward Propagation procedure more clearly.

So far, we have discussed how to build a model, including using linear regression, logistic regression and NN. So, what is the next step when we get a training model? Could we make it better? These questions will be discussed next week — How to evaluate the model.

I hope this article is helpful/useful to you, and if you like it, please give me a 👏. Any feedbacks, thoughts, comments, suggestions, or questions are welcomed!