Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Courage to Learn ML: Tackling Vanishing and Exploding Gradients (Part 2)

Amy Ma
37 min readMay 3, 2024

--

Activation Functions

What exactly are activation functions, and how do I choose the right one?

Image created by the author using ChatGPT.

How does the choice of activation function affect issues like vanishing and exploding gradients? what criteria define a good activation function?

Given these essential properties, how do popular activation functions build upon our basic model, the Sigmoid, and what makes them stand out?

Image created by author using Mathcha.com
Cumulative distribution function from Wikipedia. Source: https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/300px-Normal_Distribution_CDF.svg.png

What is exactly a good activation function by reviewing those ReLU and the other activation function inspired by it?

Can you give me a more intuitive explanation on why we want activation functions to output negatives?

Image created by the author using ChatGPT.

Could we create an activation function like ReLU that zeros out positive inputs instead, similar to using min(0, x)? Why do we prefer functions that approach zero from the negative side rather than zeroing out the positive inputs?

I get that for functions like Leaky ReLU, we want to output negative values to keep the output centered around zero. But why are ELU, SELU, and GELU specifically designed to saturate with negative inputs?

When should we choose each activation function? Why is ReLU still the most popular activation function in practice?

Weight Initialization

Why is weight initialization important, and how can it help mitigate unstable gradients?

What is a good way to initialize weights?

Why not initialize all weights with a small random number?

If so, why not just use a standard normal distribution (N(0,1)) for weight initialization?

So, to control the output values in the middle layers of a neural network, which also serve as inputs for subsequent layers, we use distributions with carefully chosen mean and variance for weight initialization. But how do the most popular methods achieve control over this variance?

# Different types of initializations

| Initialization | Activation functions | σ² (Normal) |
| -------------- | ----------------------------- | ----------- |
| Xavier/Glorot | None, tanh, logistic, softmax | 1 / fan_avg |
| He/Kaiming | ReLU and variants | 2 / fan_in |
| LeCun | SELU | 1 / fan_in |

How is weight initialization implemented in PyTorch, and what makes it special?

I get that we need to be careful in choosing the mean and variance when initializing the weights from a distribution. But what I’m still not clear on is why we would want to draw the initial weights from a normal distribution versus a uniform distribution. Can you explain the reasoning behind using one over the other?

Do we also use those weight initialization methods for the bias terms? How do we initialize the biases?

Batch Normalization

Why does batch normalization work? Why is making the input to each layer have zero mean and unit variance helpful for solving gradient issues?

How to Apply Batch Normalization? Should It Be Before or After Activation? How to Handle It During Training and Testing?

Why is batch normalization applied during the forward pass rather than directly to the gradients during backpropagation?

You mentioned the option of clipping gradients instead of directly normalizing them. Why exactly do we choose to clip gradients rather than flooring them?

In Practice (Personal Experience)

What’s the reality? What’s the common process in practice?

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Amy Ma
Amy Ma

Written by Amy Ma

Tech, life, and the chaos in between—fueled by curiosity, caffeine, and a toddler 🍼☕🐾 Want more? My newsletter -https://theamyma101.substack.com

No responses yet