Vanishing And Exploding Gradient Problems in Deep Learning

Fraidoon Omarzai

4 min readJul 23, 2024

Exploring the Vanishing and Exploding Gradient Issues in Deep Learning.

Vanishing Gradient Problem

It occurs when the gradients of the loss function with respect to the parameters (weights) become very small during backpropagation, effectively preventing the weights from updating properly.
It primarily affects deep neural networks, where multiple layers exist between the input and output layers. The problem is more pronounced in networks with many layers, such as those used in deep learning.
So the gradient often gets smaller and smaller and approaches zero which eventually leaves the weights of the lower layer nearly unchanged. As a result, the gradient descent never converges to the optimum.

Causes of the Vanishing Gradient Problem

Activation Functions:

Sigmoid and tanh functions can squash input values into a small range (0 to 1 for sigmoid, -1 to 1 for tanh), leading to derivatives (gradients) that are small.

2. Weight Initialization:

Poor initialization of weights can lead to outputs that are either very large or very small, which when fed into activation functions, result in small gradients.

3. Deep Architectures:

As the number of layers increases, the likelihood of the gradients diminishing exponentially increases. Each layer’s small gradients compound across many layers, leading to vanishing gradients.

Solutions to the Vanishing Gradient Problem

1. Alternative Activation Functions:

ReLU (Rectified Linear Unit): ReLU activation functions do not squash the input values, hence the gradient is not diminished. Variants like Leaky ReLU, Parametric ReLU, and Exponential Linear Unit (ELU) are also used to avoid issues like dead neurons.
Swish and Mish: More recent activation functions that mitigate vanishing gradients by allowing small gradients to flow even for small or negative inputs.

2. Weight Initialization Techniques:

Xavier Initialization: Helps to keep the scale of the gradients roughly the same across layers.
He Initialization: Specifically designed for layers with ReLU activation functions to maintain the variance of gradients.

3. Batch Normalization:

Normalizes the input of each layer to have a stable distribution of activations during training. This helps maintain gradients at an appropriate scale.

4. Residual Networks (ResNets):

Introduce skip connections that allow gradients to flow directly through the network, effectively reducing the depth that gradients need to travel through and mitigating the vanishing gradient problem.

5. Gradient Clipping:

A technique where gradients are clipped to a maximum value during backpropagation to prevent them from becoming too small (or too large).

6. Using Long Short-Term Memory (LSTM) Networks

In the context of recurrent neural networks (RNNs), LSTMs are specifically designed to address the vanishing gradient problem.

7. Regularization Techniques:

Adding L2 regularization or dropout can help improve the stability of the learning process, indirectly mitigating vanishing gradients.

Exploding Gradient Problem

It primarily affects deep neural networks, recurrent neural networks (RNNs), and models with long sequences of operations. The problem is more severe in networks with many layers or time steps.
The gradient keeps on getting larger and larger as the result backpropagation algorithm progresses, it causes very large weight updates and causes the gradient descent to diverge.

Causes of the Exploding Gradient Problem

1. Weight Initialization:

Poor initialization of weights can lead to very large gradient values if the weights are not scaled appropriately.

2. Deep Architectures and Long Sequences:

Deep neural networks and RNNs with many layers or long sequences exacerbate the problem due to the compounding effect of gradients.

Solutions to the Exploding Gradient Problem

1. Gradient Clipping:

Norm Clipping: Clip the gradients if their norm exceeds a certain threshold. This keeps the gradients within a manageable range.
Value Clipping: Directly limit the values of gradients to be within a specified range.

2. Weight Initialization Techniques:

Xavier Initialization: Helps maintain the variance of gradients and activations across layers, suitable for tanh and sigmoid activation functions.
He Initialization: Specifically designed for ReLU activation functions to maintain the variance of gradients.

3. Proper Activation Functions:

Use activation functions like ReLU and its variants (Leaky ReLU, Parametric ReLU, ELU) which are less likely to cause exploding gradients compared to sigmoid and tanh.

4. Normalization Techniques:

Batch Normalization: Normalizes the input of each layer to have a stable distribution, reducing the risk of exploding gradients.
Layer Normalization: Similar to batch normalization but normalizes across the features instead of the batch dimension.

5. Regularization Techniques:

Techniques such as L2 regularization can help in controlling the growth of weight values, indirectly helping with exploding gradients.

6. Adaptive Learning Rates:

Using optimization algorithms like RMSprop, AdaGrad, and Adam that adapt the learning rate based on the gradients can help manage large updates.