[ML Shot of the Day] Clash of the Titans

Why is ReLU Activation preferred over Sigmoid Activation?

Diving Deeper into Deep Learning — ReLU vs Sigmoid Activation function.

Pritish Jadhav
Geek Culture

--

Ivan Aleksic — Unsplash
  • Over the last few years, Deep Neural Network architectures have played a pivotal role in solving some of the most complex machine learning problems.
  • Training a Deep neural network is not trivial and often involves billions of parameters trained over terabytes of data.
  • Building a robust trainable network is an art that needs practice and an understanding of how the basic building blocks interact with each other.
  • In this blog post, we are going to dive deeper into the choice of activation functions, specifically, discuss the advantages of using ReLU activation over Sigmoid activation function.

A Crash Course on ReLU and Sigmoid Activation:

  • Consider a 1-unit 1-layer neural net (aka Logistic Regression) architecture with an input of size n and a 1-D (scalar) output.

ReLU Activation:

  • ReLU is a monotonically increasing Linear Activation function.
  • It is important to note that the ReLU activation function maps negative input values to zero.
  • As a result, a negative preactivation results in a 0 activation. Since the derivative of a constant is a constant, a zero activation results in a gradient of 0.
  • This phenomenon is termed “dead neurons”. Such neurons are essentially useless and may result in suboptimal performance of the network.
  • The problem of “dead neurons” can arise either from large learning rates or from a large negative bias.
  • Using a small learning rate or variations of ReLU (Leaky ReLU, ELU) can help ameliorate the problem of “dead neurons”.
Fig 1.1
Fig 1.2

Sigmoid Activation:

  • The sigmoid activation is an ‘S-Shaped’ curve that maps the input values in the range of 0 and 1.
  • The value of the sigmoid function asymptotically approaches 0 and 1. This is in line with the way probabilities work.
Fig 2.1
Fig 2.2

Most of the deep neural network architectures with state-of-the-art results use ReLU activation by default over sigmoid activation. This begs the question, why is ReLU preferred over Sigmoid?

The answer lies in backpropagation.

  • Training a deep neural network involves two steps, forward propagation, and backward propagation.
  • Backward propagation involves computation of the gradient of the cost function wrt the weights. This in turn involves computing the gradient of the activation function wrt weights.
  • It can be seen that the gradient of the sigmoid function is a product of g(x) and (1- g(x)). Since g(x) is always less than 1, multiplication of two values less than 1 result in an even smaller value.
  • It is easier to see that a repetitive computation of the gradient of the sigmoid function will result in a value that approaching 0.
  • In addition to this, computing the gradient of the current layer is dependent on the gradient of the next layer. For n hidden layers this ultimately results in the multiplication of n small values, preventing early layers in the neural net from training effectively.
  • This is termed the problem of vanishing gradients. The problem of vanishing gradients prevents us from building deep neural networks.

How to prevent gradients from vanishing?

  • One of the easiest ways to fix the problem of vanishing gradients is to replace the sigmoid activation function with ReLU activation whose gradient is a constant.
  • Some of the more complex solutions involve the use of residual networks where activations from previous layers are added to the preactivation of the layer deeper in the network.
Image by The Author

Final Thoughts:

  • One would argue that using the ReLU activation could lead to the problem of exploding gradients.
  • However, exploding gradients are easier to fix by employing gradient clipping.
  • It is important to note that the problem of “dead neurons” as well as “vanishing gradients” are roadblocks in training an optimal network.
  • However, with deep neural network architectures, the probability of encountering the problem of vanishing gradients is higher resulting in early layers failing to learn features.
  • On the other hand, dead neurons in the later stages of the network have no impact on how well the early layers are training.

Let’s have a chat :

--

--