Published in

Python’s Gurus

2 min readJul 4, 2024

💡𝗢𝘃𝗲𝗿𝗰𝗼𝗺𝗶𝗻𝗴 𝘁𝗵𝗲 𝗦𝗶𝗹𝗲𝗻𝘁 𝗞𝗶𝗹𝗹𝗲𝗿 𝗼𝗳 𝗡𝗲𝘂𝗿𝗮𝗹 𝗡𝗲𝘁𝘄𝗼𝗿𝗸𝘀: 𝗧𝗵𝗲 𝗩𝗮𝗻𝗶𝘀𝗵𝗶𝗻𝗴 𝗚𝗿𝗮𝗱𝗶𝗲𝗻𝘁 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 📉

𝗪𝗵𝗮𝘁 𝗶𝘀 𝗮 𝗩𝗮𝗻𝗶𝘀𝗵𝗶𝗻𝗴 𝗚𝗿𝗮𝗱𝗶𝗲𝗻𝘁 𝗣𝗿𝗼𝗯𝗹𝗲𝗺-

The vanishing gradient problem is a significant issue in the training of deep neural networks, particularly those with many layers. It occurs when gradients used to update the neural network’s weights during training become extremely small, effectively preventing the weights from changing their values and thus stalling the training process.

This is particularly problematic for activation functions like the sigmoid or tanh, which squash their inputs into a small range, leading to small gradients.

𝗨𝘀𝗲 𝗼𝗳 𝗚𝗿𝗮𝗱𝗶𝗲𝗻𝘁 𝗗𝗲𝘀𝗰𝗲𝗻𝘁-

In deep learning, gradient descent is used to minimize the loss function, which measures the difference between the network’s predictions and the actual targets. The network’s weights are adjusted in the direction opposite to the gradient of the loss function with respect to the weights. This adjustment is scaled by the learning rate.

𝗥𝗲𝗮𝘀𝗼𝗻𝘀 𝗳𝗼𝗿 𝘁𝗵𝗲 𝗖𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝘀-

𝟭. 𝗔𝗰𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀:

Functions like sigmoid and tanh squash input values to a small range (0 to 1 for sigmoid, -1 to 1 for tanh), causing their derivatives to be less than 1. When multiplied across many layers, these small derivatives can result in extremely small gradients.

𝟮. 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻:

Poor weight initialization can exacerbate the problem. If initial weights are too small, the signals shrink rapidly as they pass through each layer.

𝗘𝗳𝗳𝗲𝗰𝘁𝘀-

𝗦𝗹𝗼𝘄 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴:

Layers close to the input layer learn very slowly since their gradients are too small to make significant updates.

𝗣𝗼𝗼𝗿 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲:

The network may fail to learn important features, leading to suboptimal performance on the task.

𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝘀-

𝟭. 𝗔𝗰𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀:

Use activation functions that mitigate this problem, such as ReLU or its variants. These functions do not saturate for positive inputs, hence they maintain stronger gradients.

𝟮. 𝗪𝗲𝗶𝗴𝗵𝘁 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻:

Techniques like ‘He initialization’ (for ReLU) or ‘Xavier initialization’ (for Tanh and Sigmoid) can help maintain the scale of gradients.

𝟯. 𝗕𝗮𝘁𝗰𝗵 𝗡𝗼𝗿𝗺𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻:

This technique normalizes the inputs to each layer, maintaining a stable distribution of inputs throughout the training, which helps in preserving gradients.

𝟰. 𝗥𝗲𝘀𝗶𝗱𝘂𝗮𝗹 𝗡𝗲𝘁𝘄𝗼𝗿𝗸𝘀:

ResNets These networks use skip connections, allowing gradients to flow more directly through the network, thus mitigating the vanishing gradient problem.

#BackwordPropagation #DataScience #GradientDescent #𝐃𝐞𝐞𝐩𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 #VanishingGradient #ActivationFunction #𝐍𝐞𝐮𝐫𝐚𝐥𝐍𝐞𝐭𝐰𝐨𝐫𝐤 #𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 #𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞𝐀𝐈 #LLM

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

Be sure to clap x50 time and follow the writer ️👏️️
Follow us: Newsletter
Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

Python’s Gurus🚀

Written by Neeharika Patel