Gradient Descent Problems and Solutions in Neural Networks

Shachi Kaul

Published in

Analytics Vidhya

10 min readMar 12, 2020

Source: https://deeplizard.com/learn/video/qO_NLVjD6zE

Milestones

Introduction
Neural Network Story in Short
Crux of backward-propagation
Gradient Descent
Intuition
Why Derivatives in NN?
How Gradient Descent works?
Gradient Descent Problems
Cause of Gradient Problems
Solutions to avoid Gradient Problems

Introduction

Gradient Problems are the ones which are the obstacles for Neural Networks to train. Usually you can find this in Artificial Neural Networks involving gradient based methods and back-propagation. But today in deep learning era, various alternate solutions are introduced eradicating the flaws of network learning. This blog will give you a deep insight experience of all sorts of Gradient Problems detailing about its causal situations and solutions. Also, blog will infer with an idea about Neural Network architecture and learning process along with key computations.

Neural Network Story in Short

Neural Network is a network of interconnected neurons having weight, bias and activation function. Learning starts from the linear(affine) transformation of inputs to non-linear transformations using an activation function passing the phases of forward-propagation and back-propagation. Let’s get it to the core now.

The whole learning process is divided into recursive forward and backward-propagation.

Forward-propagation:

Inputs are passed to neurons of hidden layer with some randomly initialized weights along with biases, as a linear transformation shown below.

z = (input * weight) + bias

To solve complex problems, non-linear transformation is introduced to be achieved by an Activation function. The output of linear transformation(z) i.e. weighted sum of inputs is supplied to the below activation function.

A = f(z)

Let’s say for Sigmoid Activation function,

Every layer applies these linear followed by non-linear equations. Hence, final output of every layer would be the output of activation function.

You will get predicted output in an output layer (y^), doing same processes in every layer.

Backward-propagation:

Predicted output(y^) may differ from the actual output(y) hence, loss is calculated using loss(cost) function (J). This tells that how diverted our prediction is from the actual. Let’s say loss using mean sum of squared loss function.

This loss is propagated back till initial layers while updating the weights for each neuron in every layer. This process of propagating back the error with optimal weights is referred to as back-propagation. Meanwhile computing how each weight impacts on “error”. Idea is to minimize the weights of neurons contributing more in the cost function(error).

Now, to reduce the cost function (no loss), weights should be adjusted. This can be done by hit and trial across training iterations which is very cumbersome. Hence, optimizer is needed to adjust the weights in order to minimize the cost function(loss). To understand how weights are affecting the inputs, derivative of cost function is calculated i.e. rate of change of loss w.r.t weight (dJ / dw). Thus, came Gradient Descent.

As defined above, back-propagation is used to compute partial derivative of cost function J(w) whose value will be used in Gradient Descent algorithm. The end result would be the optimized weights which will be updated as per below equation.

Term “backward” means that gradient computation starts from backwards through the network. Gradient of weights of last layer is computed first while first layer at the last.

Again, feed-forward the activation outputs, get the loss and repeat until satisfactory result is obtained.

Crux of backward-propagation

A set of input neurons with input(x) connected with neurons of next layer having certain weights (w) are multiplied and passed to an activation function giving certain output. Error is calculated keeping in mind the actual output which is back-propagated using certain derivatives of cost function. Let’s discuss about it.

Consider the below network.

Basically back-propagation is to update the weights for reducing loss in the next iteration. Update weight equation as:

Thus, we need to have derivatives (dJ/dw) hence, differentiation concept of chain-rule comes into view i.e. derivative of composite function.

To calculate derivative of error w.r.t first weight, back-propagate via chain rule (as already shown in fig).

Shown above is a change in error function due to weight is change in error function due to activation (final) output multiplies change in activation due to weight. The output for whole network is the activation of Hidden2 neuron(a2), thus derivative of Sigmoid function of Hidden2 layer.

Aggregating all derivatives will be put in weight update equation in fig, gives out new weight.

Gradient Descent

Intuition

There is a beautiful explanation crossed by during a research:

Suppose you are blindfolded and have to reach a lake which is at the lowest point of mountain. Since zero visibility, you can only reach by touching the ground and getting the idea of slope. Wherever land descends, we do one step down to reach out more faster towards lake. This process of steeping down towards slope acts as a gradient descent algorithm which is an iterative method.

When we say Gradient, it refers to gradient of loss function with respect to weights in a network.

Gradient is a vector having direction and magnitude. Gradient is a slope (derivative w.r.t weights) of convex curve. It is calculated during back-propagation after which parameters(weights) got updated.

Why Derivatives in NN?

Derivatives are generally used in optimization problems such as Gradient Descent to optimize the weights(increase/decrease) to reach the minimum cost function value.
In NN, derivative of cost function with respect to weight(w) is computed during back-propagation. This refer to the impact of change of weight parameter when calculating gradient descent.

How Gradient Descent works

In NN, optimal weights which are supposed to be propagated backwards, are calculated by gradient descent algorithm which inturn is calculated by the partial derivatives as in fig3.

Let’s talk about below figure.
Graph between cost function w.r.t weight demonstrates that how it’s achieving its aim of reaching Global cost minima point. Each step towards minima point is determined by gradient (slope) i.e. derivative while step size depends on learn_rate. Choosing inappropriate learn_rate and activation function lead to various gradient problems which will be discussed in later sections.

Gradient Descent Algorithm

Randomly initialize weights w
Compute gradient G using derivative of cost function wrt weights J(w)

3. Weight update equation: w = w-ηG
Here, η is a learn_rate which should not be too high or low to skip or not at all converging to min point.

4. Repeat steps 2 to 3 until it becomes a constant change.

Gradient Problems

I. Vanishing Gradient

Vanishing gradient is a scenario in the learning process of neural networks where model doesn’t learn at all. It is due to when gradient becomes too small, almost vanishes leads to weights got stuck and never reach the optimal value for minimal loss(global minima). Thus network not able to learn and converge. And especially during chain rule differentiation, back-propagating from last to initial layer may lead to no updates of weights at all.

II. Exploding Gradient

Exactly opposite to vanishing gradient when model keeps on learning, weights keep on updating large but model never gets converged.Computes gradient (loss) with respect to weights which becomes extremely large in the earlier layers in such a way that it explodes. Keeps on oscillating, taking large step size as shown in above third figure and diverge from the convergence point while moving away from it.

III. Saddle Point (MiniMax Point)

Saddle point on a surface of loss function is that diplomatic point where seeing from one dimension, that critical point seems minimum while from other dimension it seems as a maximum point.

Saddle point is a fuss around learning since it causes a confusion. When model learning stops at this critical thinking as “minimum” is reached hence slope=0 which is actually a maximum cost value from other dimension. This results in an non-optimal point.

Saddle point comes into view when gradient descent runs in multi-dimension.

Here’s scenarios of critical points.

Cause of Gradient Problems

Before jumping into causes of gradient problems, let’s see how other parameters are responsible for our neural network model to not able to converge.

Learning rate
Gradient Descent

Learning Rate

Learning rate refers to the rate of decrement/increment of weights. Low learning rate leads to so many updates and model will never be able to reach global minimum point which is actually a low cost function(loss) value. High learning rate will just explode with too large weight updates and may skip the model convergence point. Thus, setting optimal value will make our model nicely by reaching the minimum point(low cost value).

Model will also not converge if gradient term(dJ/dw) which is derivative of error function in weight update equation (gradient descent formula) is too small or too large.

Too small calculated gradient (dJ/dw) when multiplies with learning rate, results into smaller value. And if learning rate is also very low, result would even be more smaller. Subtracting this smaller value with weights will hardly results any change in weight. Leads to no converge of model.
Or oppositely, product of higher gradient with learning rate leads to higher value where when subtracted from weights, results into huge weights updates in each epoch and hence may bounce the optimal value. Both scenarios will never allow model to converge.

But the question is,

Why would gradient be too low or high?

Gradient Descent is a weight optimizer which involves cost function and activation function. How? Well, let’s look over the chain rule of gradient descent during back-propagation.

Here, J refers to the cost function where term (dJ/dw1) is a derivative of cost function w.r.t weight. In layman, let’s see how w1 has an impact on error function. Term (dy^/da2) is a derivative of an activation function of output layer. Another term (da2/da1) is a derivative of an activation function of hidden layer, let’s say Sigmoid activation function for both output and hidden layer.

Range of Sigmoid derivative is (0 ,1/4]. When chain of sigmoid function derivatives are multiplied where maximum 1/4, leads to more smaller values.

Product of these both terms:

Sigmoid derivative range (eg, 1/4) when multiplies with weight of range (-1,1) leads to even smaller value. Simple maths speaks that two smaller number results to more smaller number. Hence, chain of such derivatives, infer (dJ/dw) as a tiny value. This is how gradient will be too low that it almost vanishes. Same way, for an activation function Hyperbolic Tangent (tanh) whose derivative range is [0,1] which is again smaller finite value results the same like above. Anyways, the max output range of Sigmoid and tanh is (0,1) and (-1,1) respectively.

In cases of Sigmoid and Tanh activation functions, gradient decreases exponentially when propagate to initial layers from output layer. Hence, very slow or not at all learning.

Gradient problems come when Sigmoid activation function or an alternative tanh function comes into picture due to the range of their derivatives. To be more clear, range of sigmoid derivative(0,1/4] and tanh [0,1] are the root causes of the problem.

Solutions to avoid Gradient Problems

To avoid gradient issues, it’s best to select an appropriate activation function for hidden layers. For hidden layers, any activation function except sigmoid and tanh can be used. For eg, ReLU, LeakyReLU etc. But how will they solve the gradient problem?

ReLU:

ReLU became quite popular after the drawbacks of Sigmoid and tanh functions. This has been quite useful in hidden layers.

Input range from (-infinite,infinite) yields an output of range (0,input) respectively. Derivative of ReLU function for input less than 0 is 0 while equals or greater than 1 as 1.

Source

Since ReLU function is not within the range of (0,1) like sigmoid and tanh, , gradient would not be tiny and vanishing gradient problem is solved.

However, ReLU has a drawback of resulting into Dead Neurons which is not in the scope of this blog. To overcome its short-coming, LeakyReLU, ELU got introduced.

Some of the fantastic references are:

https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b

https://brilliant.org/wiki/backpropagation/

https://www.jeremyjordan.me/nn-learning-rate/

Happy Reading!

Can get in touch with me via LinkedIn.

Feel free to share views in comments section or any misleading information stated. :)