Neeharika Patel
Pythonโ€™s Gurus
2 min readJul 4, 2024

--

๐Ÿ’ก๐—ข๐˜ƒ๐—ฒ๐—ฟ๐—ฐ๐—ผ๐—บ๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ ๐—ฆ๐—ถ๐—น๐—ฒ๐—ป๐˜ ๐—ž๐—ถ๐—น๐—น๐—ฒ๐—ฟ ๐—ผ๐—ณ ๐—ก๐—ฒ๐˜‚๐—ฟ๐—ฎ๐—น ๐—ก๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ๐˜€: ๐—ง๐—ต๐—ฒ ๐—ฉ๐—ฎ๐—ป๐—ถ๐˜€๐—ต๐—ถ๐—ป๐—ด ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ถ๐—ฒ๐—ป๐˜ ๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ ๐Ÿ“‰

Photo by Marcus Bellamy on Unsplash

๐—ช๐—ต๐—ฎ๐˜ ๐—ถ๐˜€ ๐—ฎ ๐—ฉ๐—ฎ๐—ป๐—ถ๐˜€๐—ต๐—ถ๐—ป๐—ด ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ถ๐—ฒ๐—ป๐˜ ๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ-

The vanishing gradient problem is a significant issue in the training of deep neural networks, particularly those with many layers. It occurs when gradients used to update the neural networkโ€™s weights during training become extremely small, effectively preventing the weights from changing their values and thus stalling the training process.

This is particularly problematic for activation functions like the sigmoid or tanh, which squash their inputs into a small range, leading to small gradients.

๐—จ๐˜€๐—ฒ ๐—ผ๐—ณ ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ถ๐—ฒ๐—ป๐˜ ๐——๐—ฒ๐˜€๐—ฐ๐—ฒ๐—ป๐˜-

In deep learning, gradient descent is used to minimize the loss function, which measures the difference between the networkโ€™s predictions and the actual targets. The networkโ€™s weights are adjusted in the direction opposite to the gradient of the loss function with respect to the weights. This adjustment is scaled by the learning rate.

๐—ฅ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐˜€ ๐—ณ๐—ผ๐—ฟ ๐˜๐—ต๐—ฒ ๐—–๐—ผ๐—ป๐—ฑ๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€-

๐Ÿญ. ๐—”๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—™๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€:

Functions like sigmoid and tanh squash input values to a small range (0 to 1 for sigmoid, -1 to 1 for tanh), causing their derivatives to be less than 1. When multiplied across many layers, these small derivatives can result in extremely small gradients.

๐Ÿฎ. ๐—œ๐—ป๐—ถ๐˜๐—ถ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป:

Poor weight initialization can exacerbate the problem. If initial weights are too small, the signals shrink rapidly as they pass through each layer.

๐—˜๐—ณ๐—ณ๐—ฒ๐—ฐ๐˜๐˜€-

๐—ฆ๐—น๐—ผ๐˜„ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด:

Layers close to the input layer learn very slowly since their gradients are too small to make significant updates.

๐—ฃ๐—ผ๐—ผ๐—ฟ ๐—ฃ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ:

The network may fail to learn important features, leading to suboptimal performance on the task.

๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€-

๐Ÿญ. ๐—”๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—™๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€:

Use activation functions that mitigate this problem, such as ReLU or its variants. These functions do not saturate for positive inputs, hence they maintain stronger gradients.

๐Ÿฎ. ๐—ช๐—ฒ๐—ถ๐—ด๐—ต๐˜ ๐—œ๐—ป๐—ถ๐˜๐—ถ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป:

Techniques like โ€˜He initializationโ€™ (for ReLU) or โ€˜Xavier initializationโ€™ (for Tanh and Sigmoid) can help maintain the scale of gradients.

๐Ÿฏ. ๐—•๐—ฎ๐˜๐—ฐ๐—ต ๐—ก๐—ผ๐—ฟ๐—บ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป:

This technique normalizes the inputs to each layer, maintaining a stable distribution of inputs throughout the training, which helps in preserving gradients.

๐Ÿฐ. ๐—ฅ๐—ฒ๐˜€๐—ถ๐—ฑ๐˜‚๐—ฎ๐—น ๐—ก๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ๐˜€:

ResNets These networks use skip connections, allowing gradients to flow more directly through the network, thus mitigating the vanishing gradient problem.

#BackwordPropagation #DataScience #GradientDescent #๐ƒ๐ž๐ž๐ฉ๐‹๐ž๐š๐ซ๐ง๐ข๐ง๐  #VanishingGradient #ActivationFunction #๐๐ž๐ฎ๐ซ๐š๐ฅ๐๐ž๐ญ๐ฐ๐จ๐ซ๐ค #๐€๐ซ๐ญ๐ข๐Ÿ๐ข๐œ๐ข๐š๐ฅ๐ˆ๐ง๐ญ๐ž๐ฅ๐ฅ๐ข๐ ๐ž๐ง๐œ๐ž #๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ฏ๐ž๐€๐ˆ #LLM

Pythonโ€™s Gurus๐Ÿš€

Thank you for being a part of the Pythonโ€™s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ๏ธ๐Ÿ‘๏ธ๏ธ
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--