Understanding Deep Neural Networks for beginners — Part 2

Chamuditha Kekulawala
5 min readJul 13, 2024

--

In part 1, we gave an introduction to DNNs and their challenges. Now let’s talk about non-saturating activation functions.

Non-saturating Activation Functions

The vanishing/exploding gradients problems we discussed in the previous article, were in part due to a poor choice of activation function. Until then, most people had assumed that if biological neurons use roughly sigmoid activation functions, they must be an excellent choice for artificial neurons. But it turns out that other activation functions behave much better in DNNs, in particular the ReLU activation function, mostly because it does not saturate for positive values (and also because it is quite fast to compute).

Unfortunately, the ReLU activation function is not perfect. It suffers from a problem known as the dying ReLUs: during training, some neurons effectively die, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate.

A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting 0s, and gradient descent does not affect it anymore since the gradient of the ReLU function is 0 when its input is negative.

Unless it is part of the first hidden layer, a dead neuron may sometimes come back to life: gradient descent may indeed tweak neurons in the layers below, in such a way that the weighted sum of the dead neuron’s inputs is positive again.

To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU. This function is defined as,

LeakyReLUα(z) = max(αz, z ):

The hyperparameter α defines how much the function “leaks”: it is the slope of the function for z < 0, and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up.

A 2015 paper compared several variants of the ReLU activation function and one of its conclusions was that the leaky variants always out-performed the strict ReLU activation function. In fact, setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak).

They also evaluated the randomized leaky ReLU (RReLU), where α is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer (reducing the risk of overfitting the training set).

Finally, they also evaluated the parametric leaky ReLU (PReLU), where α is authorized to be learned during training (instead of being a hyper-parameter, it becomes a parameter that can be modified by back-propagation like any other parameter). This was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

Last but not least, a 2015 paper proposed a new activation function called the exponential linear unit (ELU) that outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set. It’s definition is shown in the equation:

It is represented in the figure below:

It looks a lot like the ReLU function, with a few major differences:

  1. It takes on negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem, as discussed earlier. The hyperparameter α defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter if you want.
  2. It has a nonzero gradient for z < 0, which avoids the dead neurons problem.
  3. If α is equal to 1 then the function is smooth everywhere, including around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.

The main drawback of the ELU activation function is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but during training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.

Moreover, in a 2017 paper called “Self-Normalizing Neural Networks”, the authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function (which is just a scaled version of the ELU activation function), then the network will self-normalize: the output of each layer will tend to preserve mean 0 and standard deviation 1 during training, which solves the vanishing/exploding gradients problem. As a result, this activation function often outperforms other activation functions very significantly for such neural nets (especially deep ones).

However, there are a few conditions for self-normalization to happen:

  • The input features must be standardized (mean 0 and standard deviation 1).
  • Every hidden layer’s weights must also be initialized using LeCun normal initialization.
  • The network’s architecture must be sequential. Unfortunately, if you try to use SELU in non-sequential architectures, such as RNNs or networks with skip connections (i.e., connections that skip layers, such as in wide & deep nets), self-normalization will not be guaranteed, so SELU will not necessarily outperform other activation functions.
  • The paper only guarantees self-normalization if all layers are dense. However, in practice the SELU activation function seems to work great with convolutional neural nets as well.

Which activation function should you use for hidden layers?

Although your mileage will vary, in general;

SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic.

  • If the network’s architecture prevents it from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0).
  • If you care a lot about runtime latency, then you may prefer leaky ReLU.
  • If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a huge training set.

In part 3 we’ll talk about Batch normalization! Thanks for reading 🎉

--

--

Chamuditha Kekulawala

Full-stack development | Machine Learning | Computer Architecture