10 Popular Types of Activation Functions

Everything you need to know about the activation functions in Deep learning!

Gontla Praveen
9 min readAug 13, 2020

Let us quickly revise about Artificial Neural Network, a neuron in the ANN takes input from input layers or previous layers, we just multiply the inputs by the weights, add a bias and apply an activation function to the result, and pass the output to the next layer.

Artificial Neural Network (source)

Now, we will try to understand the Activation function in-detail…

What is an Activation Function?

The activation function is nothing but a mathematical “gate” in between the input for the current neuron and its output going to the next layer. Activation functions define whether that neuron should be activated or not. The main purpose of activation functions is to add non-linearity to the neural network. This will help the neural network to understand important data and suppress irrelevant data.

Neuron (source)

Is it necessary to use Activation functions?

If we don't use Activation functions, the neural network will only make linear transformations by keep on multiplying inputs and weights and adding bias. However, this will reduce the computations but fails to understand non-linear and complex data. Thus we use non-linear transformations in the neural network and activation functions are used to add this non-linearity in the network.

Properties of Activation function:

  1. Differentiable: Activation functions must be differentiable because we have to differentiate all the neurons in backpropagation for optimizing the loss function.
  2. Computational Expense: Activation functions are applied after every layer and will be calculated millions of times in a deep neural network. So, Activation functions should be computationally inexpensive.
  3. Vanishing Gradient problem: We will discuss this in the next section in detail. To understand intuitively, the model stops learning because there won’t be much change to the gradient for extreme input values. This is called causing a vanishing gradient problem.
  4. Monotonic: Activation functions must be either entirely non-decreasing or entirely non-increasing.

You need to be familiar with basic calculus to understand these properties.

But…..

Okay, Let's understand different types of activation functions:

Types of Activation Functions

  • Binary Step function
  • Linear function
  • Non-Linear function

Binary Step Function

In simple terms, Binary functions will activate the neuron only if the input to the step function is above the threshold value; otherwise deactivates its output is not considered for the next hidden layer.

Let us look at it mathematically-

Limitations:

  • Applicable for only binary classification; not multi-classification
  • The gradient of the step function will be zero (Gradient is zero = Weights and bias won’t be updated in backpropagation)

Linear Function:

Linear Function and Derivative of Linear Function

Equation: Linear function has the equation similar to the equation of a straight line

i.e. y = ax

Linear activation is proportional to the input. The variable ‘a’ is any constant value.

Linear Activation Functions have one advantage over binary activation. The gradient of this step function won’t be zero but the derivative of the step function won’t have ‘x’. Hence, only bias will be updated in backpropagation but weights will not be updated.

If we use linear activation functions, no matter how many layers are there in the neural network, the last layer will be again a linear function of the first layer (linear combination of all the linear functions still a linear function).

Limitations:

  • The differential result is constant.
  • All layers of the neural network turn into one layer

Non-Linear Activation Functions

Practical real-life problems are often nonlinear in nature. Therefore, these problems cannot be solved without non-linear functions. Non-Linear Activation Functions are essential to introduce the non-linearity into the network. Thus helps the neural networks to learn and model non-linear and complex data, such as images, video, audio, and high dimensionality datasets.

10 commonly used Non-Linear Activation Functions

  1. Sigmoid Function
  2. Tanh Function
  3. ReLU
  4. Leaky ReLU
  5. Exponential ReLU
  6. Parametric ReLU
  7. SWISH
  8. Softmax
  9. Soft Plus
  10. MaxOut

1. Sigmoid function

Sigmoid function shrink the input values into values between 0 and 1.

“Sigmoid activation function is mainly preferred for classification problems”

Here is the mathematical expression for sigmoid:

Sigmoid function (blue) Derivative of Sigmoid (Orange) (source)

Advantages

  • Smooth gradient, preventing “jumps” in output values.
  • Sigmoid is S-shaped, ‘monotonic’ & ‘differential’ function.
  • Output values range between 0 and 1, shrinks input of a neuron.

Disadvantages

  • Vanishing gradient — for extreme values of X, there is almost no change to the gradient. No gradients = No learning. This is called causing a vanishing gradient problem.

Note: Vanishing gradient causes because sigmoid squeezes the inputs

  • non-zero-centric function.This makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
  • Computationally expensive because of slow convergence due to exponential function

2. Tanh function

Tanh function is similar to the sigmoid function but this step function is symmetric around the origin (zero centric function). Tanh function ranges from -1 to 1. The Tanh function is defined as-

Tanh Function (Blue) Derivative of Tanh (orange) (source)

Advantages

  • Tanh is ‘monotonic’ & ‘differential’ function.
  • zero centric function: makes it easier for the model to understand strongly negative, neutral, and strongly positive input values.
  • Optimization is easier

Disadvantages:

  • Vanishing gradient problem.
  • Computationally expensive; Slow convergence due to exponential function

ReLU (Rectified Linear Unit)

ReLU activation function activates neurons only if the input of the step function is more than 0; otherwise deactivates. ReLU is mathematically defined as:

ReLU Function (Blue) Derivative of ReLU (orange) (source)

Advantages:

  • ReLU is ‘monotonic’ & ‘differential’ function.
  • Computationally efficient — converges very quickly

Disadvantages

  • ReLU is non zero centric function
  • The Dying ReLU problem — ReLU blocks the inputs for less than zero. Suppose, if the bias is very negative and the inputs are not positive enough to overcome the bias, the neurons will be dead. Dead neurons don’t learn.

Leaky ReLU

Unlike ReLU, Leaky ReLU allows a small constant slope for negative inputs, enabling backpropagation, even for negative input values

Leaky ReLU Function (Blue) Derivative of Leaky ReLU (orange) (source)

Advantages:

  • Addresses the Dying ReLU problem.
  • Enables backpropagation, even for negative input values
  • Leaky ReLU is derivative both are monotonic

Disadvantages

  • Inconsistent predictions for negative inputs

Exponential ReLU

Exponential ReLU is an improved version of ReLU to solve the Dying ReLU problem.

ELU function

Advantages:

  • Addresses the Dying ReLU problem.
  • ELU uses a log curve for defining the negative values
  • Zero-centric function

Disadvantages:

  • Computationally intensive.
  • Slow convergence due to exponential function.

Parametric ReLU

PReLU is an extended version of ReLU. For negative inputs, PReLU has a small slope, which avoids Dying ReLU Problem/dead neurons. Compared to ELU, PReLU uses hyperparameter ‘α’ as a slope for negative input values.

Mathematically, Parametric ReLU is defined as:

ReLU vs PReLU (source)

Advantages

  • Proposed to solve the Dying ReLU problem.
  • Allows the negative slope to be learned — this function uses hyperparameter as a slope of the negative part of the function. This hyperparameter will be trained in backpropagation and learn the most appropriate value of α.
  • Even if the slope is small, it does not tend to 0, which is a certain advantage.
  • Faster and more optimum convergence.

Disadvantages

  • Same as Leaky ReLU
  • May perform differently for different problems.

SWISH

The swish function was inspired by the sigmoid function. This function is used for gating in LSTMs and highway networks. We use the same value for gating to simplify the gating mechanism, which is called self-gating.

Mathematically, SWISH is defined as:

f(x) = x * sigmoid(x)

SWISH Function (blue) Derivative of SWISH (orange) (source)

Advantages:

  1. For deep networks, swish achieves higher test accuracy than ReLU.
  2. For every batch size, swish outperforms ReLU.
  3. Swish is non-monotonic function i.e for all x < 0, swish decreases even when the input values are increasing

Softmax

Softmax function is used for multiclass classification problems. This function returns a vector of size K (K= number of classes) with probability (values ranging between [0,1])for a data point belonging to each individual class.

Here is the mathematical expression for softmax:

Softmax Function
Softmax Activation Function (source)

For a multiclass classification problem, the output layer will have the same number of neurons as the number of target classes. For suppose, if you have three target classes, there will be three neurons in the output layer. Suppose, the output values of the neurons is [1.2, 0.9, 0.75]. After passing these values to the softmax function, you will get the output vector of probabilities as— [0.42, 0.31, 0.27]. These values represent the probability of the data point belonging to each target class. The data point belongs to ‘first’ target class as the probability is highest.

Advantages

  • Able to handle multiple classes, unlike sigmoid.
  • Used for output neurons — Softmax is used only for the output layer. for Typically classifies the inputs into multiple categories.

Soft Plus

Soft plus function, smooth ReLU, is nothing but ReLu but the only difference is it is smoother than ReLU

Softplus function: f(x) = ln(1+ex)

Soft plus function produces outputs in the range of (0, +∞).

Surprisingly, derivative of softplus is sigmoid

You can observe that the derivative of the Softplus function is f’(x) is logistic function

f’(x) = (1/(1+exp x)).

MaxOut

Maxout activation (introduced recently by Goodfellow et al.) generalizes the ReLU and leaky ReLU. It is a learnable activation function. MaxOut function takes only the maximum inputs of the neuron and blocks other inputs.

Mathematically, MaxOut is defined as:

ReLU and Leaky ReLU are the special cases of the MaxOut function. MaxOut is ReLU if we have w1,b1 =0. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks.

Activation functions through dance moves:

Dance moves of Activation Functions (source)

Wrapping up

If you understood the above-discussed activation functions, you understood almost all commonly used activation functions. This blog is just to help you to get started with various activation functions but you have to work and study on your own. If you have any queries, feel free to comment down your queries.
Happy Learning!!!

References:

https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/

https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/

https://medium.com/r/?url=https%3A%2F%2Fyoutu.be%2FWK9P_s6hXIc

https://medium.com/swlh/firing-up-the-neurons-all-about-activation-functions-55d1b6a8eff

--

--