DEEP LEARNING SERIES, CHAPTER 2 (PART B)| Towards AI
Introduction to Neural Networks and Their Key Elements (Part-B) — Hyper-Parameters
In the previous story (part A) we discussed the structure and three main building blocks of a Neural Network. This story will take you through the elements which really make a useful force and separate them from rest of the Machine Learning Algorithms.
Previously we discussed about Units/Neurons, Weights/Parameters & Biases today we will discuss — Hyper-Parameters
Hyper-Parameters & Parameters:
These are the values which you must manually set. If you think of an NN as a machine, the nobs that change the behavior of the machine would be the hyper-parameters of the NN. A hyper-parameter is a value required by your model which we really have very little idea about. These values can be learned mostly by trial and error. There is no, one-fits-all for hyper-parameters.
So far, for simplicity, we have not paid explicit attention to differentiating between parameters and hyperparameters, but here we will discuss the difference between them. In general, we consider a parameter of the model as a configuration variable that is internal to the model and whose value can be estimated from the data.
In contrast, by hyperparameter we refer to configuration variables that are external to the model itself and whose value in general cannot be estimated from the data and are specified by the programmer to adjust the learning algorithms. It takes a lot of experience and intuition to find the optimal values of these hyperparameters, which must be specified before starting the training process so that the models train better and more quickly.
We will not go into detail about all of them, but we have hyperparameters that are worth mentioning briefly, both at the structure and topology level of the neural network (number of layers, number of neurons, their activation functions, etc.) and at the learning algorithm level (learning rate, momentum, epochs, batch size, etc.). Next, we will introduce some of them:
- Batch Size
- Learning Rate
- Learning Rate Decay
- Initialization of parameter weights
Epochs tells us the number of times all the training data have passed through the neural network in the training process. A good clue is to increase the number of epochs until the accuracy metric with the validation data starts to decrease, even when the accuracy of the training data continues to increase (this is when we detect a potential over-fitting).
As we have said before, we can partition the training data in mini batches to pass them through the network. In TensorFlow, the batch size is the argument that indicates the size of these batches that will be used in the method (a function you are using to train your NN) in an iteration of the training to update the gradient. The optimal size will depend on many factors, including the memory capacity of the computer that we use to do the calculations.
The gradient vector has a direction and a magnitude. Gradient descent algorithms multiply the magnitude of the gradient by a scalar known as learning rate (also sometimes called step size) to determine the next point.
The proper value of this hyperparameter is very dependent on the problem in question, but in general, if this is too big, huge steps are being made, which could be good to go faster in the learning process.
Learning Rate Decay:
But the best learning rate in general is one that decreases as the model approaches a solution. To achieve this effect, we have another hyperparameter, the learning rate decay, which is used to decrease the learning rate as epochs go by to allow learning to advance faster at the beginning with larger learning rates. As progress is made, smaller and smaller adjustments are made to facilitate the convergence of the training process to the minimum of the loss function.
In real and more complex cases and, visually, it is as if we could find several local minima’s and the loss function had a form like the one in the following figure:
In this case, the optimizer can easily get stuck at a local minimum and the algorithm may think that the global minimum has been reached, leading to suboptimal results. The reason is that the moment we get stuck, the gradient is zero and we can no longer get out of the local minimum strictly following the path of the gradient.
One way to solve this situation could be to restart the process from different random positions and, in this way, increase the probability of reaching the global minimum.
To avoid this situation, another solution that is generally used involves the momentum hyperparameter. In an intuitive way, we can see it as if, to move forward, it will take the weighted average of the previous steps to obtain a bit of impetus and overcome the “bumps” as a way of not getting stuck in local minima. If we consider that the average of the previous ones was better, perhaps it will allow us to make the jump.
But using the average has proved to be a very drastic solution because, perhaps in gradients of previous steps, it is much less relevant than just in the previous one. That is why we have chosen to weight the previous gradients, and the momentum is a constant between 0 and 1 that is used for this weighting. It has been shown that algorithms that use momentum work better in practice.
One variant is the Nesterov momentum, which is a slightly different version of the momentum update that has recently gained popularity, and which basically slows down the gradient when it is close to the solution.
Initialization of parameter weights:
The initialization of the parameters’ weight is not exactly a hyperparameter, but it is as important as any of them and that is why we make a brief paragraph in this section. It is advisable to initialize the weights with small random values to break the symmetry between different neurons, if two neurons have exactly the same weights, they will always have the same gradient; that supposes that both have the same values in the subsequent iterations, so they will not be able to learn different characteristics.
Initializing the parameters randomly following a standard normal distribution is correct, but it can lead to possible problems of vanishing gradients (when the values of a gradient are too small and the model stops learning or takes too long due to that) or exploding gradients (when the algorithm assigns an exaggeratedly high importance to the weights).
So that is all for today’s post in the next we will look deep into Activation Functions and there role in a Deep Neural Network. Until then enjoy deep learning!