Deep Learning Specialization Course
Course 2: Improving Deep Neural Networks: Hyperparameters tuning, Regularization and Optimization (Week 3 notes)
I have been receiving really good response with my notes on this course. It inspires me to do well on every article. If you have not yet read my 1st and 2nd-week notes, please check out this article.
Without any further delay, let's dive right into this week’s learning.
HyperParameter Tuning
Training the neural networks involve setting several hyperparameters. In this week we will learn to find a good setting for them.
- Tuning Process
The question that arises while tuning the hyperparameters is how do you select the set of values to explore? During the earlier times, it was common practice to plot the hyperparameters on the grid and systematically explore their values. The preferable choice of the grid was five by five. But this worked best when the number of parameters was relatively small. The recommended practice is to choose points at random and then try out the hyperparameters on this randomly chosen set of points.
The above diagram considers only two hyperparameters. If we have three parameters then we can use a cube with the third dimension. By sampling within the three-dimensional cube, we get to try out a lot more values for each of the hyperparameters.
2. Using an appropriate scale to pick hyperparameters
We saw that sampling at random over the range of hyperparameters can allow the search over the space of hyperparameters more efficiently. It is important to pick an appropriate scale on which we can explore the hyperparameters.
To understand this, consider the number of hidden units hyperparameter. The range we are interested in is from 50 to 100. We can use a grid that contains values between 50 and 100 to find the best value:
Now consider the learning rate with a range between 0.0001 and 1. If we draw a number line with these extreme values and sample the values uniformly at random, around 90% of the values will fall between 0.1 to 1. In other words, we are using 90% of resources to search between 0.1 to 1, and only 10% to search between 0.0001 to 0.1. This does not look correct! Instead, we can use a log scale to choose the values:
Batch Normalization
One of the important ideas in deep learning is an algorithm called batch normalization that helps to train an algorithm faster.
Recall that in logistic regression, normalizing the inputs can speed up the learnings.
In the case of deeper models, apart from input features, we have activations in all the layers. Wouldn’t it be nice, if we can normalize the mean and variance of activations to make the training of W and b more efficient? For example, we want to train W3 and b3 faster in the following network. Since a2 is the input to the next layer, it will affect the training of W3 and b3.
Here is how we can implement batch normalization in a single training layer:
Given some intermediate values in neural network Z¹ to Z^m, normalization can be calculated as follows:
Here, Epsilon is added for numerical stability. Every component of Z will have a mean 0 and variance of 1. For hidden units, it will not make sense if we have the same mean and variance for every component. Hence, it will make more sense to have different distribution for hidden units. We can utilize the following formula to calculate Z tilda.
Where gamma and beta are learnable parameters. Batch norm applies the normalization process to input layers as well as the deep layers. The only difference between training the input layer and the hidden layer is that we do not want to apply zero mean and variance one to all the values of hidden layers as we want to take advantage of the nonlinearity function. The hidden layer has standardized mean and variance where both are controlled by explicit parameters gamma and beta.
Adding Batch Norm to a network
Why Batch Norm really works?
- Normalizing the input features to mean zero and variance one, increases the speed similarly batch norm does that to hidden layers.
- Consider the scenario of covariate shift, where distribution changes with another dataset and the algorithm fail to generalize on that dataset. Similarly, in the neural network, if any we consider any hidden layer, the input values keep on changing resulting in covariate shift. Batch norm reduces the amount that the distribution of these hidden unit values shifts around.
Batch Norm at a test time
When we apply batch norm on training data, it works with mini-batches with steps as shown below.
But during the test time, we will want to process one example at a time. Hence, We will use an exponentially weighted average for calculating the value of the Z norm.
Multiclass Classification
Generalization of logistic regression is known as softmax regression which can let us make predictions to identify more than two classes. In the below image, the outer layer will have four output units where we will try to identify the probability of each of the four classes.
For calculating the probability of each class, we will use the softmax activation function which is different than other activation functions such as relu. In the softmax activation function, we will take a temporary variable t and calculate its value by doing an element-wise summation of each output unit as shown below. Here we have considered 4 as our output parameter.
Then the output aL is going to be the vector t normalized to sum 1 as shown in the below example.
The following are few examples of decision boundaries created using the softmax classification.
With this, we completed our second course in the deep learning specialization.
Please follow me to get some more useful articles on machine learning.
Stay a lifelong learner !! Happy learning !!