“Activation Functions” in Deep learning models. How to Choose?

11 min readMar 9, 2023

Sigmoid, tanh, Softmax, ReLU, Leaky ReLU EXPLAINED !!!

Introduction

The activation function calculates a weighted total and then adds bias to it to decide whether a neuron should be activated or not. The Activation Function’s goal is to introduce non-linearity into a neuron’s output.

What is the role of Activation function in Deep Learning?

A Neural Network without an activation function is basically a linear regression model in Deep Learning, since these functions perform non-linear computations on the input of a Neural Network, enabling it to learn and do more complex tasks.

We know that neurons in a Neural Network work following their weight, bias, and activation function. We would change the weights and biases of the neurons in a Neural Network based on the output error. Back-propagation is the term for this process. Because the gradients are supplied simultaneously with the error to update the weights and biases, activation functions, therefore, enable back-propagation.

An activation function determines if a neuron should be activated or not activated.

What is Activation Function?

“In artificial neural networks, each neuron forms a weighted sum of its inputs and passes the resulting scalar value through a function referred to as an activation function.”

Activation function is function used to transform the activation level of a unit(neuron) into an output signal based on given set of inputs. Activation Function is one of the most important component in building Deep Neural Network.

Types of Activation functions

Linear Activation Function
Binary Step Function
Non-linear Activation Functions

1. Linear Activation Function

The linear activation function, often called the identity activation function, is proportional to the input. The range of the linear activation function will be (-∞ to ∞). The linear activation function simply adds up the weighted total of the inputs and returns the result.

Mathematically, it can be represented as:

f(x) = x

Problems: If we differentiate a linear function to introduce non-linearity, the outcome will no longer be related to the input “x” and the function will become constant, hence our procedure will not show any behavior's.
Applications: The linear activation function is only used once, in the output layer.

2. Binary Step Function

A threshold value determines whether a neuron should be activated or not activated in a binary step activation function.

Mathematically

Pros and Cons

It cannot provide multi-value outputs — for example, it cannot be used for multi-class classification problems.
The step function’s gradient is zero, which makes the back propagation procedure difficult.

3. Non-linear Activation Functions

A) Sigmoid

Sigmoid accepts a number as input and returns a number between 0 and 1. It’s simple to use and has all the desirable qualities of activation functions: nonlinearity, continuous differentiation, monotonicity, and a set output range.

Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.
Monotonic function: A function which is either entirely non-increasing or non-decreasing.

This is mainly used in binary classification problems. This sigmoid function gives the probability of an existence of a particular class.

Mathematically, it can be represented as:

**Sigmoid Activation Function — Equation**

Pros and Cons

It is non-linear in nature. Combinations of this function are also non-linear, and it will give an analogue activation, unlike binary step activation function. It has a smooth gradient too, and It’s good for a classifier type problem.
The output of the activation function is range (0,1) compared to (-∞, ∞) of linear activation function. As a result, we’ve defined a range for our activations.
Sigmoid function gives rise to a problem of “Vanishing gradients” and Sigmoids saturate and kill gradients.
Its output isn’t zero centred, and it makes the gradient updates go too far in different directions. The output value is between 0 and 1, so it makes optimization harder.
The network either refuses to learn more or is extremely slow.

B) TanH (Hyperbolic Tangent)

TanH compress a real-valued number to the range [-1, 1]. It’s non-linear, But it’s different from Sigmoid,and its output is zero-centered. The main advantage of this is that the negative inputs will be mapped strongly to the negative and zero inputs will be mapped to almost zero in the graph of TanH.

Mathematically, TanH function can be represented as:

Pros and Cons

TanH also has the vanishing gradient problem, but the gradient is stronger for TanH than sigmoid (derivatives are steeper).
TanH is zero-centered, and gradients do not have to move in a specific direction.

C) ReLU (Rectified Linear Unit)

ReLU stands for Rectified Linear Unit and is one of the most commonly used activation function in the applications. It’s solved the problem of vanishing gradient because the maximum value of the gradient of ReLU function is one. It also solved the problem of saturating neuron, since the slope is never zero for ReLU function. The range of ReLU is between 0 and infinity.

Mathematically, it can be represented as:

ReLU Activation Function — Equation

Pros and Cons

Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and TanH functions.
ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property.
One of its limitations is that it should only be used within hidden layers of an artificial neural network model.
Some gradients can be fragile during training.
In other words, For activations in the region (x<0) of ReLu, the gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons, which go into that state will stop responding to variations in input (simplybecause the gradient is 0, nothing changes.) This is called the dying ReLu problem.

Causes of the dying ReLU

Setting high learning rates
Having a large negative bias

Note : For more details about ReLU dying problem check here

D) Leaky ReLU

Leaky ReLU is an upgraded version of the ReLU activation function to solve the dying ReLU problem, as it has a small positive slope in the negative area. But, the consistency of the benefit across tasks is presently ambiguous.

**Leaky ReLU Activation Function — Graph**

Mathematically, it can be represented as,

Leaky ReLU Activation Function — Equation

Pros and Cons

The advantages of Leaky ReLU are the same as that of ReLU, in addition to the fact that it does enable back propagation, even for negative input values.
Making minor modification of negative input values, the gradient of the left side of the graph comes out to be a real (non-zero) value. As a result, there would be no more dead neurons in that area.
The predictions may not be steady for negative input values.

E) ELU (Exponential Linear Units)

ELU is also one of the variations of ReLU which also solves the dead ReLU problem. ELU, just like leaky ReLU also considers negative values by introducing a new alpha parameter and multiplying it will another equation.

ELU is slightly more computationally expensive than leaky ReLU, and it’s very similar to ReLU except negative inputs. They are both in identity function shape for positive inputs.

Pros and Cons

ELU is a strong alternative to ReLU. Different from the ReLU, ELU can produce negative outputs.
Exponential operations are there in ELU, So it increases the computational time.
No learning about the ‘a’ value takes place, and exploding gradient problem.

F) Softmax

A combination of many sigmoids is referred to as the Softmax function. It determines relative probability. Similar to the sigmoid activation function, the Softmax function returns the probability of each class/labels. In multi-class classification, softmax activation function is most commonly used for the last layer of the neural network.

The softmax function gives the probability of the current class with respect to others. This means that it also considers the possibility of other classes too.

Mathematically, it can be represented as:

**Softmax Activation Function — Equation**

Pros and Cons

It mimics the one encoded label better than the absolute values.
We would lose information if we used absolute (modulus) values, but the exponential takes care of this on its own.
The softmax function should be used for multi-label classification and regression task as well.

G) Swish

Swish allows for the propagation of a few numbers of negative weights, whereas ReLU sets all non-positive weights to zero. This is a crucial property that determines the success of non-monotonic smooth activation functions, such as Swish’s, in progressively deep neural networks.

It’s a self-gated activation function created by Google researchers.

Mathematically, it can be represented as:

Swish Activation Function — Equation

Pros and Cons

Swish is a smooth activation function that means that it does not suddenly change direction like ReLU does near x equal to zero. Rather, it smoothly bends from 0 towards values < 0 and then upwards again.
Non-positive values were zeroed out in ReLU activation function. Negative numbers, on the other hand, may be valuable for detecting patterns in the data. Because of the sparsity, large negative numbers are wiped out, resulting in a win-win situation.
The swish activation function being non-monotonous enhances the term of input data and weight to be learnt.
Slightly more computationally expensive and More problems with the algorithm will probably arise given time.

Why derivative/differentiation is used ?

When updating the curve, to know in which direction and how much to change or update the curve depending upon the slope.That is why we use differentiation in almost every part of Machine Learning and Deep Learning.

Desirable features of an activation function

Vanishing Gradient problem: Neural Networks are trained using the process gradient descent. The gradient descent consists of the backward propagation step which is basically chain rule to get the change in weights in order to reduce the loss after every epoch. Consider a two-layer network and the first layer is represented as f₁(x) and the second layer is represented as f₂(x). The overall network is o(x) = f₂(f₁(x)). If we calculate weights during the backward pass, we get o`(x) = f₂(x)*f₁`(x). Here f₁(x) is itself a compound function consisting of Act(W₁*x₁ + b₁) where Act is the activation function after layer 1. Applying chain rule again, we clearly see that f₁`(x) = Act(W₁*x₁ + b₁)*x₁ which means it also depends directly on the activation value. Now imagine such a chain rule going through multiple layers while backpropagation. If the value of Act() is between 0 and 1, then several such values will get multiplied to calculate the gradient of the initial layers. This reduces the value of the gradient for the initial layers and those layers are not able to learn properly. In other words, their gradients tend to vanish because of the depth of the network and the activation shifting the value to zero. This is called the vanishing gradient problem. So we want our activation function to not shift the gradient towards zero.
Zero-Centered: Output of the activation function should be symmetrical at zero so that the gradients do not shift to a particular direction.
Computational Expense: Activation functions are applied after every layer and need to be calculated millions of times in deep networks. Hence, they should be computationally inexpensive to calculate.
Differentiable: As mentioned, neural networks are trained using the gradient descent process, hence the layers in the model need to differentiable or at least differentiable in parts. This is a necessary requirement for a function to work as activation function layer.

What Activation Function Should I Use?

I will answer this question with the best answer there is: it depends.

***Neural Network Activation Functions:***

WHICH ACTIVATION FUNCTION SHOULD YOU USE? SOME TIPS.

All hidden layers generally use the same activation functions. ReLU activation function should only be used in the hidden layer for better results.
Sigmoid and TanH activation functions should not be utilized in hidden layers due to the vanishing gradient.
Swish function is used in artificial neural networks having a depth more than 40 layers.
Regression problems should use linear activation functions
Binary classification problems should use the sigmoid activation function
Multiclass classification problems shold use the softmax activation function

Neural network architecture and their usable activation functions,

Convolutional Neural Network (CNN): ReLU activation function
Recurrent Neural Network (RNN): TanH or sigmoid activation functions

Specifically, it depends on the problem you are trying to solve and the value range of the output you’re expecting.

For example, if you want your neural network to predict values that are larger than one, then tanh or sigmoid is not suitable for use in the output layer, and we must use ReLU instead.

On the other hand, if we expect the output values to be in the range [0,1] or [-1, 1] then ReLU is not a good choice for the output layer and we must use sigmoid or tanh.

If you perform a classification task and want the neural network to predict a probability distribution over the mutually exclusive class labels, then you should use the softmax activation function in the last layer.

What is Parametric ReLU ?

Rectified Linear Unit (ReLU) is an activation function in neural networks. It is a popular choice among developers and…

medium.com

The Dying ReLU Problem, Causes and Solutions

Keep your neural network alive by understanding the downsides of ReLU

medium.com

End Notes:

If you liked this post, share with your interest group, friends and colleagues. Comment down your thoughts, opinions and feedback below. I would love to hear from you. Do follow me for more such articles and motivating me 😀.

It doesn’t cost you anything to clap. 👏

“Activation Functions” in Deep learning models. How to Choose?

Introduction

What is the role of Activation function in Deep Learning?

What is Activation Function?

Types of Activation functions

1. Linear Activation Function

f(x) = x

2. Binary Step Function

Pros and Cons

3. Non-linear Activation Functions

A) Sigmoid

Pros and Cons

B) TanH (Hyperbolic Tangent)

Pros and Cons

C) ReLU (Rectified Linear Unit)

Pros and Cons

Causes of the dying ReLU

D) Leaky ReLU

Pros and Cons

E) ELU (Exponential Linear Units)

Pros and Cons

F) Softmax

Pros and Cons

G) Swish

Pros and Cons

Why derivative/differentiation is used ?

Desirable features of an activation function

What Activation Function Should I Use?

WHICH ACTIVATION FUNCTION SHOULD YOU USE? SOME TIPS.

What is Parametric ReLU ?

Rectified Linear Unit (ReLU) is an activation function in neural networks. It is a popular choice among developers and…

The Dying ReLU Problem, Causes and Solutions

Keep your neural network alive by understanding the downsides of ReLU

End Notes:

It doesn’t cost you anything to clap. 👏

Written by Shubham Koli