# The curious case of the vanishing & exploding gradient

## Understanding why gradients explode or vanish and methods for dealing with the problem.

--

In the last post, we introduced a step by step walkthrough of RNN training and how to derive the gradients of the network weights using back propagation and the chain rule. But it turns out that during this training the RNN can suffer greatly from two problems: 1. Vanishing gradients or 2. Exploding gradients.

## Why Gradients Explode or Vanish

Recall the many-to-many architecture for text generation shown below and in the introduction to RNN post, lets assume the input sequence to the network is a 20 word sentence: “I grew up in France,…….. I speak French fluently.

We can see from the example above that for the RNN to predict the word “French” which comes at the end of the sequence, it would need information from the word “France”, which occurs further back at the beginning of the sentence. This kind of dependence between sequence data is called long-term dependencies because the distance between the relevant information “France” and the point where it is needed to make a prediction “French” is very wide. Unfortunately, in practice as this distance becomes wider, RNNs have a hard time learning these dependencies because it encounters either a vanishing or exploding gradient problem.

These problems arise during training of a deep network when the gradients are being propagated back in time all the way to the initial layer. The gradients coming from the deeper layers have to go through continuous matrix multiplications because of the the chain rule, and as they approach the earlier layers, if they have small values (<1), they shrink exponentially until they vanish and make it impossible for the model to learn , this is the vanishing gradient problem. While on the other hand if they have large values (>1) they get larger and eventually blow up and crash the model, this is the exploding gradient problem

When gradients explode, the gradients could become `NaN` because of the numerical overflow or we might see irregular oscillations in training cost when we plot the learning curve. A solution to fix this is to apply gradient clipping; which places a predefined threshold on the gradients to prevent it from getting too large, and by doing this it doesn’t change the direction of the gradients it only change its length.