Intelligent Signals : Unstable Deep Learning. Why and How to solve them ?

Sayon Dutta
6 min readMay 18, 2017

--

Deep learning models are basically deep layered neural networks where we try to discover those hidden patterns/information through multiple layers of abstraction. These abstractions are the reason behind the advantage of deep neural networks over other traditional machine learning algorithms while dealing with complex and large amount of data.

As we all know at least in basic terms that it’s the backpropagation which is the key behind the learning/training of these deep neural networks. It’s this backpropagation which trains a deep learning model and it’s the maths behind backpropagation which will tell us why it’s hard to train at the same time.

We all know a deep neural network is nothing but a black box of weights and biases trained over a large data to discover the hidden patterns/information which otherwise being impossible for humans and even if possible then it’s not that much scalable. When we scale down to a neuron(node in neural network) level we find each of the neurons present in different layers learn at different speeds i.e. they have different gradients.

Why then these deep learning models are hard to train ? 
While training, why our training loss gets stuck to a constant values after certain number of iterations ?
Why sometime training error increases gradually after certain number of iterations?

Learning or as we say training happens from later layers(right side) to early layers(left side). As a result, later layers learn well but early layers learn very less during the process. This becomes worse with increase in number of layers. And, there are also cases where earlier layers learn extremely fast and later layers face the slowdown.

Learning about these difficulties and understanding them helps us to understand the reasonable solutions available and how it can be made better. Remember, it’s not magic it’s mathematics.

VANISHING GRADIENT PROBLEM

Instead of giving you a definition, it’s better if to create, explore and witness this problem. Follow the steps:

- Try creating one hidden layer neural network
- Try adding couple of hidden layers one by one
- Observe the gradient of loss w.r.t. the nodes at different layers
- You will observe that the gradient values get relatively smaller while moving from later layers to early layers
- More layers you add the gradients of early layers' nodes shrink more
- This shows that the early layer neurons tend to learn slowly compared to the later layer neurons
- The condition gets worse with the addition of more layers.

This phenomenon described in above steps is what we know as “vanishing gradient problem”.

EXPLODING GRADIENT PROBLEM

Trying the same approach as mentioned above, but when the observations are totally opposite i.e. the gradients become larger in earlier layers. This tends to “exploding gradient problem”.

We will get into the details of both with basic examples. But, let’s think for a while that sometimes gradients vanish while sometimes they explode, these gradients which are the key behind backpropagation are actually unstable.

So, even if the later layers learn properly, still the overall training faces difficulties to classify the inputs correctly because the learning in the earlier layers have been hampered and since they carry the initial abstraction which affects the abstraction in the later layers.

Simulating a Vanishing Gradient Problem

Let us consider an example:

Multi-layered Neural Network, where each layer contains only one neuron/node
aj is the post activation output except a0 which is the input node
zj is the matrix multiplication of node inputs in addition with biases.

Let’s try backpropagation on this given neural network with sigmoid as the activation function at each input,

Since,

therefore,

similarly,

as a result,

Since, sigmoid function is,

therefore, the derivative of sigmoid function be,

Let’s say, initially weights were initialised from a standard normal distribution, i.e. mean = 0 and standard deviation = 1.

let’s check further,

As you can see, the gradient of loss with respect to bias attached to the inputs from first layer would be very less compared to the the gradient with respect to bias attached to the activation outputs from third layer (i.e. 2nd hidden layer) i.e. learning slows down from 2nd hidden layer to the first layer.Gradient actually tells the speed of learning/training at that node. 

Simulating a Exploding Gradient Problem

Taking the above case and moving forward, there can be many cases where weights grow during training as result of constant increase in weights are each successive layers causing gradient explosion. As a result, learning rate increases but fails to converge to the optimum. Instead of vanishing we have a totally opposite scenario here called exploding gradient problem.

Let’s consider the similar multi layered neural network(with sigmoid activation function) architecture mentioned in the above section and go through a following case below and see how gradient explosion might occur.

As a result, at each successive layer an increasing factor > 1 will get multiplied to compute the gradient in that successive layer and will increase the gradients at each successive layer. For example, if the initial weight be say 100, then that increasing factor will shoot over 25 and explode the gradients in earlier layers.

What’s the actual issue here ?

Going through both the cases of vanishing and exploding gradient, we can easily conclude that we are dealing here with unstable gradient problem in our deep learning models. With increase in hidden layers we add more instability.

Moreover, the toy cases considered above had only one neuron(node unit) at each layer unlike actual deployed deep learning models comprising of several layers each having lots of neurons.

Most of you who have studied about Recurrent Neural Nets(RNN), where we backpropagate through time, we face the issue of vanishing gradient problem. Owing to which, long term dependencies and knowledge is lost. As a result, we have seen different RNN cell units like LSTMs and GRUs being used instead of basic RNN cell because LSTMs and GRUs have gated attributes to tackle the vanishing gradient problem.

Possible Solutions

Following are the possible solutions which you can put into practise whenever you are training your deep learning models:

- Try using different activation function(apart from sigmoid)
- Incorporate momentum based stochastic gradient descent
- Proper initialisation of weights and biases
- Regularisation (add regularisation loss to your data loss and try to minimise that)

p.s. : Feel free to comment, drop any queries and suggest possible edits.

--

--