Optimization Problem in Deep Neural Networks

Rochak Agrawal
Analytics Vidhya

--

Training deep neural networks to achieve the best performance is a challenging task. In this post, I would be exploring the most common problems and their solutions. These problems include taking too long to train, vanishing and exploding gradients and initialization. All these problems are known as Optimization problems. Another category of issue that arises while training the network is Regularization Problem. I have discussed them in my previous post. If you haven’t already read it, you can read it by clicking the link below.

Input Normalization

While training a neural network, you may observe that the model is taking too long than expected. This is because the input data to the network is not normalized. Let us try to understand what that means by considering two input features.

In the Raw Data, the range for the feature on X-axis is 5 to 50 whereas the range for the feature on Y-axis is 3 to 7. On the other hand, in the Normalized Data, the range for the feature on X-axis is -0.15 to +0.15 whereas the range for the feature on Y-axis is -1.5 to +1.5

By normalizing the data, I mean Scaling the values so that the ranges of features are quite similar. Normalizing the data is a 2 step process.

Subtracting the data by the mean of the data; it makes the mean of the data equal to 0. And then, dividing the data by its variance; it scales the data.

mu = np.mean(X)
X = X - mu
sigma = np.linalg.norm(X)
X = X/sigma

It is worth noting that we should use the same value of mu and sigma for transforming the test data also because we want to scale them both in the same way.

But why does Normalization work?

Now that we know, how to normalize a dataset, let us try to understand why normalization works with the following example. Given below is a contour plot between cost value J, weights W and bias b. The center represents the minimum cost which we have to achieve.

The plot on the right looks more symmetric, and that is the key to the working principle behind normalization.

If the ranges of features vary greatly, then the values of the different weights would also vary greatly and will take up more time to select the perfect set of weights. However, if we use the normalized data, then the weights won’t vary greatly, and we will obtain the ideal set of weights in lesser time.

Moreover, if we use the raw data, then we have to use a lower learning rate to adjust to the varied contour heights. But in the case of normalized data, we have more spherical contours, and we can go straight into the minimum by choosing a bigger learning rate.

The intuition is that when the features are on a similar scale, it becomes easy to optimize the weights and bias.

Vanishing and Exploding Gradients

The problem of vanishing and exploding gradients stems from the initialization of the weights. Both of the above issues lead to improper and slower training of the network. As their names suggest, vanishing gradients occur when the weights vanish and end up being too small; whereas, in exploding gradients, the weights explode and become too big. Let us understand them better with the help of an example.

Let W be the weight matrix of all layers initialized close to the identity matrix I.

In the forward propagation, the output Z of a particular layer is defined by the following formula where W is the weight matrix, X is the input, and b is the bias:

If we perform the above computation over L layers, then we can assume that the weight matrices W will be multiplied L times ignoring the bias.

Now, if a particular value which is greater than 1, let say 1.5, the activation of the layers will increase exponentially, the gradients will be big, and gradient descent will take huge steps, and the network will take a long time to reach the minimum. This problem is known as exploding gradients.

Similarly, if a particular value less than 1, let say 0.9, the activation of the layers will decrease exponentially, the gradients will be too small, and gradient descent will take minuscule steps and taking a long time to reach the minimum. This problem is known as vanishing gradients.

To, avoid the problem of exploding and vanishing gradients, we should follow the following rules:

  1. The mean of the activations should be zero.
  2. The variance of the activations should stay the same across every layer.

If we follow the above rules, we ensure that the gradient descent does not take too big or too small steps and move towards the minimum in an orderly manner and avoid exploding and vanishing gradients. It also means that the network will train at a faster rate and optimize quickly. Since the problem stems from improper initialization of the weights, we can fix it by initializing the weights properly.

Xavier Initialization

Xavier initialization is used when the activation function of a specific layer is Tanh. We can use the Xavier initialization in the following manner:

# Let the dimesnion of weight matrix be(5,3)
# The variance is (1/neurons in previous layer)
# Randn ensure that the mean = 0
W = np.random.randn(5,3) * np.sqrt(1/3))

He initialization

He initialization is used when the activation function of a particular layer is ReLU. We can use the He initialization in the following way:

# Let the shape of the weight matrix be(5,3)
# The variance is (2/neurons in previous layer)
# Randn ensure that the mean = 0
W = np.random.randn(5,3) * np.sqrt(2/3))

I want to thank the readers for reading the story. If you have any questions or doubts, feel free to ask them in the comments section below. I’ll be more than happy to answer them and help you out. If you like the story, please follow me to get regular updates when I publish a new story. I welcome any suggestions that will improve my stories.

--

--