Activation Functions: Sigmoid, Tanh, ReLU, Leaky ReLU, Softmax

Mukesh Chaudhary
6 min readAug 28, 2020

--

Comparing Sigmoid function with others activation functions and Importance ReLU in Hidden Layer of NN

In this blog, I will try to compare and analysis Sigmoid( logistic) activation function with others like Tanh, ReLU, Leaky ReLU, Softmax activation function. In my previous blog, I described on how to work sigmoid function in logistic Regression algorithm. There , I described with mathematical term and python implementation code. If you want to read, you may visit the link. So now i want to analysis more sigmoid function with another activation functions. Let’s move . These all are activation function used generally in Neural Network algorithm and deep learning. Here I don’t go in depth detail about Neural Network . We try to focus only activation functions. We will discuss Neural Network on another blog.

Basic idea of a Neural Network works-

There are many algorithms in the market to solve classification problem . Neural Network is one of them which is very famous for predicting accurate data. However, it takes a lot of computational time.It is inspired by the way biological neural systems process data. It contains layers of interconnected nodes or neurons arranged in interconnected layers.

The information moves from the input layer to the hidden layers. In a simple case of each layer, we just multiply the inputs by the weights, add a bias and apply an activation function to the result and pass the output to the next layer. We keep repeating this process until we reach the last layer.

Activation Function:

Activation functions are generally two types, These are

  1. Linear or Identity Activation Function
  2. Non-Linear Activation Function.

Generally, neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.They allow back-propagation because they have a derivative function which is related to the inputs.

Non-linear Activation Functions:

Above listed all activation functions are belong to non-linear activation functions. And we will discuss below more in details.

  1. Sigmoid Activation Function:

Sigmoid Activation function is very simple which takes a real value as input and gives probability that ‘s always between 0 or 1. It looks like ‘S’ shape.

Sigmoid function and it’s derivative

It’s non-linear, continuously differentiable, monotonic, and has a fixed output range. Main advantage is simple and good for classifier. But Big disadvantage of the function is that it It gives rise to a problem of “vanishing gradients” because Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder. That takes very high computational time in hidden layer of neural network

# sigmoid function
def
sigmoid(z):
return 1.0 / (1 + np.exp(-z))
# Derivative of sigmoid function
def
sigmoid_prime(z):
return sigmoid(z) * (1-sigmoid(z))

2. Tanh or Hyperbolic tangent:

Tanh help to solve non zero centered problem of sigmoid function. Tanh squashes a real-valued number to the range [-1, 1]. It’s non-linear too.

Derivative function give us almost same as sigmoid’s derivative function.

It solve sigmoid’s drawback but it still can’t remove the vanishing gradient problem completely.

When we compare tanh activation function with sighmoid , this picture give you clear idea.

# tanh activation function
def
tanh(z):
return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
# Derivative of Tanh Activation Function
def
tanh_prime(z):
return 1 - np.power(tanh(z), 2)

3. ReLU (Rectified Linear Unit):

This is most popular activation function which is used in hidden layer of NN.The formula is deceptively simple: 𝑚𝑎𝑥(0,𝑧)max(0,z). Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.

ReLU Activation Function and It’s derivative

It’s main advantage is that it avoids and rectifies vanishing gradient problem and less computationally expensive than tanh and sigmoid. But it has also some draw back . Sometime some gradients can be fragile during training and can die. That leads to dead neurons.In another words, for activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). So We should be very carefully to choose activation function , and activation function should be as per business requirement.

When we compare with sigmoid activation function, It’s look like

# ReLU activation function
def
relu(z):
return max(0, z)
# Derivative of ReLU Activation Function
def
relu_prime(z):
return 1 if z > 0 else 0

4. Leaky ReLU

It prevents dying ReLU problem.T his variation of ReLU has a small positive slope in the negative area, so it does enable back-propagation, even for negative input values

Leaky ReLU does not provide consistent predictions for negative input values. During the front propagation if the learning rate is set very high it will overshoot killing the neuron.

The idea of leaky ReLU can be extended even further. Instead of multiplying x with a constant term we can multiply it with a hyper-parameter which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.

While we compare Leaky-ReLU with ReLU, then It shows clear concept of difference between them.

# Leaky_ReLU activation function
def
leakyrelu(z, alpha):
return max(alpha * z, z)
# Derivative of leaky_ReLU Activation Function
def
leakyrelu_prime(z, alpha):
return 1 if z > 0 else alpha

5. Softmax

Generally, we use the function at last layer of neural network which calculates the probabilities distribution of the event over ’n’ different events. The main advantage of the function is able to handle multiple classes.

when we compare the sigmoid and softmax activation functions , they produce different results.

Sigmoid input values: -0.5, 1.2, -0.1, 2.4

Sigmoid output values: 0.37, 0.77, 0.48, 0.91

SoftMax input values: -0.5, 1.2, -0.1, 2.4

SoftMaxoutput values: 0.04, 0.21, 0.05, 0.70

Sigmoid’s probabilities produced by a Sigmoid are independent. Furthermore, they are not constrained to sum to one: 0.37 + 0.77 + 0.48 + 0.91 = 2.53. The reason for this is because the Sigmoid looks at each raw output value separately. Whereas Softmax’s the outputs are interrelated. The Softmax probabilities will always sum to one by design: 0.04 + 0.21 + 0.05 + 0.70 = 1.00. In this case, if we want to increase the likelihood of one class, the other has to decrease by an equal amount.

Conclusion:

In conclusion, we can see advantage and disadvantage of all activation functions. As per our business requirement, we can choose our required activation function. Generally , we use ReLU in hidden layer to avoid vanishing gradient problem and better computation performance , and Softmax function use in last output layer .

References

https://keras.io/api/layers/activations/https://en.wikipedia.org/wiki/Activation_function#:~:text=In%20artificial%20neural%20networks%2C%20the,0)%2C%20depending%20on%20input.https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f

--

--