Hands-on Tutorials

The Exploding and Vanishing Gradients Problem in Time Series

In this post, we deal with exploding and Vanishing Gradient in Time Series and in particular in Recurrent Neural Network (RNN) by Truncated BackPropagation Through Time and Gradient Clipping.

Dr Barak Or

Published in

MetaOr Artificial Intelligence

6 min readOct 10, 2020

Intro

In this post, we focus on deep learning for sequential data techniques. All of us are familiar with this kind of data. For example, text is a sequence of words, video is a sequence of images. More challenging examples are from the branch of time series data, with medical information such as heart rate, blood pressure, etc., or finance, with stock price information. The most common AI approach for time-series tasks with deep learning is the Recurrent Neural Networks (RNNs). The motivation to use RNN lies in the generalization of the solution with respect to time. As sequences have different lengths (mostly), a classical deep learning architecture such as Multy Layers Perceptrons (MLP) can not be applied without modifying it. Moreover, the number of weights in MLP is absolutely huge! Hence, The RNN is commonly used, where the weights are shared during the entire architecture. A simple RNN architecture is shown below, where V, W, and U are the weights matrices, and b is the bias vector.

If you are not familiar with RNN, backpropagation, or MLP, please feel free to read references [1]-[3] at the end of the post to fill the gap.

Backpropagation Through Time (BPTT)

Training an RNN is done by defining a loss function (L) that measures the error between the true label and the output, and minimizes it by using forward pass and backward pass. The following simple RNN architecture summarizes the entire backpropagation through time idea.

For a single time step, the following procedure is done: first, the input arrives, then it processes through a hidden layer/state, and the estimated label is calculated. In this phase, the loss function is computed to evaluate the difference between the true label and the estimated label. The total loss function, L, is computed, and by that, the forward pass is finished. The second part is the backward pass, where the various derivatives are calculated.

The training of RNN is not trivial, as we backpropagate gradients through layers and also through time. Hence, in each time step we have to sum up all the previous contributions until the current one, as given in the equation:

In this equation, the contribution of a state at time step k to the gradient of the entire loss function L, at time step t=T is calculated. The challenge during the training is in the ratio of the hidden state:

The Vanishing and Exploding Gradients Problem

Two common problems that occur during the backpropagation of time-series data are the vanishing and exploding gradients. The equation above has two problematic cases:

In the first case, the term goes to zero exponentially fast, which makes it difficult to learn some long-period dependencies. This problem is called the vanishing gradient. In the second case, the term goes to infinity exponentially fast, and its value becomes a NaN due to the unstable process. This problem is called the exploding gradient. In the following two sections, we review two approaches to deal with these problems.

Truncated Backpropagation Through Time (Truncated BPTT).

The following “trick” tries to overcome the vanishing gradient problem by considering a moving window through the training process. It is known that in the backpropagation training scheme, there are a forward pass and a backward pass through the entire sequence to compute the loss and the gradient. By taking a window, we also improve the training performance from the training duration aspect- where we shortcut it.

This window is called a “chunk”. During the backpropagation process, we run forward and backward through this chunk of a specific size instead of the entire sequence.

The Truncated BPTT is much faster than the simple BPTT, and also less complex because we don’t make the contribution of the gradients from faraway steps. The minus of this approach is that dependencies of longer than the chunk length, are not taught during the training process. Another disadvantage is the detection of the vanishing gradients. From looking at the learning curve one can assume that the gradient vanishes, but, maybe the task itself is difficult.

For the vanishing gradient problem, many other approaches have been suggested, to mention a few of them:

Using ReLU activation function.
Long-Short Term Memory (LSTM) architecture, where the forget gate might help.

3. Initialize the weight matrix, W, with an orthogonal matrix, and use this throughout the entire training (multiplications of orthogonal matrices don’t explode or vanish).

Gradient Clipping

Considering g as the gradient of the loss function with respect to all network parameters. Now, define some threshold and run the following clip condition in the background of the training process. It is a very simple and very effective condition.

By applying the gradient clipping, we do not change the gradient direction, but only its magnitude. As the hidden state (h) derivative is the part that causes the exploding gradient, it is enough to clip the following entity:

The threshold is a key parameter the designer should manually define. We aim to choose the highest threshold which solves the exploding gradient problem, by looking at the curve of the gradient norm:

Summary

In this post, we explore the vanishing and exploding gradients problem in simple RNN architecture. These two problems belong to the class of open-problem in machine learning and the research in this pattern is very active. The Truncated BPTT and the gradient clipping approaches were discussed, with some tips for implementation.

About the Author

Dr. Barak Or is a professional in the field of artificial intelligence and sensor fusion. He is a researcher, lecturer, and entrepreneur who has published numerous patents and articles in professional journals. Dr. Or leads the MetaOr Artificial Intelligence firm. He founded ALMA Tech. LTD holds patents in the field of AI and navigation. He has worked with Qualcomm as DSP and machine learning algorithms expert. He completed his Ph.D. in machine learning for sensor fusion at the University of Haifa, Israel. He holds M.Sc. (2018) and B.Sc. (2016) degrees in Aerospace Engineering and B.A. in Economics and Management (2016, Cum Laude) from the Technion, Israel Institute of Technology. He has received several prizes and research grants from the Israel Innovation Authority, the Israeli Ministry of Defence, and the Israeli Ministry of Economic and Industrial. In 2021, he was nominated by the Technion for “graduate achievements” in the field of High-tech.

Website www.metaor.ai Linkedin www.linkedin.com/in/barakor/ YouTube www.youtube.com/channel/UCYDidZ8GUzUy_tYtxvVjRiQ

References and Further reading

[1] Understanding of Multilayer perceptron (MLP). Nitin Kumar Kain, at Medium. 2018.

[2] Understanding Neural Networks. From neuron to RNN, CNN, and Deep Learning. Vibhor Nigam, at Medium. 2018.

[3] Back-Propagation is very simple. Who made it Complicated? Prakash Jay, at Medium. 2017.

[4] Zhang, Jingzhao, et al. “Why gradient clipping accelerates training: A theoretical justification for adaptivity.” arXiv preprint arXiv:1905.11881 (2019).

[5] Chen, Xiangyi, Zhiwei Steven Wu, and Mingyi Hong. “Understanding gradient clipping in private SGD: A geometric perspective.” arXiv preprint arXiv:2006.15429 (2020).

[6] Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. “On the difficulty of training recurrent neural networks.” International conference on machine learning. 2013.

[7] Ribeiro, António H., et al. “Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness.” International Conference on Artificial Intelligence and Statistics. 2020.