Activation Function: Choose the Best Function for your Network

Among the various activation functions that you can choose from, which one is the best for your model?

Dhruv Kumar Patwari
Nerd For Tech
9 min readJul 11, 2021

--

Activation functions are the most crucial part of a neural network; they can make or break a model. This article summarises all the available activation functions and their pros, cons, and use cases. You can use this as a reference to find the one which best suits your needs and read more about that elsewhere.

To learn more about Neural networks and how activation fits into the larger scheme of things, check out my previous articles, Part 1 and Part 2.

What is an activation function?

Simply defined, an activation function is a function introduced to an artificial neural network to aid the network in learning complicated patterns in the input data. Normalising the input data is one of the prominent roles of an activation function. No one data point should get more preference only because it has a numerically significant value. It receives the primary neuron’s output signal and changes it into a form used as input to the next neuron.

Activation Function in a neuron

Your neural network will, in general, contain two sorts of activation functions. The first is an activation function in hidden layers, and the second is in the output layer. Each layer usually has the same activation function, and there can be multiple activation functions for different hidden layers in a network.

Desirable properties of an activation function

An activation function should have a few characteristics without which the network’s accuracy or speed might take a hit. I will explain the main ones here.

Differentiable

Neural networks use gradient descent for learning, and learning using gradient descent requires a function to be differentiable. If a function is not differentiable, then it cannot be used as an activation function.

Vanishing Gradient problem

As the name suggests, Vanishing gradient means that the loss function approaches 0 when more layers with specific activation functions are added to neural networks, making the network difficult to train. This means that as the layers keep increasing, the latter half of the layers in backpropagation (Learning phase of the neural network) do not have enough data to learn, so the network approaches a bottleneck on the highest accuracy it can achieve irrespective of how good the dataset is.

It is apparent now that we should avoid using any function with a vanishing gradient problem.

Zero-Centered

If a function is zero centric, it has the flexibility of going in both directions and captures the essence of the data better than a function that has a range between 0 and infinity.

Computational Expense

Our network can have thousands of densely connected nodes resulting in performing a calculation millions of times. If that calculation is computationally expensive, our network will learn slowly and require high computational capacity. Hence, the Activation function should be computationally inexpensive.

Types of activation function

Binary step function

Step Function
Source: Wikimedia | Step function

The binary step function is as the name suggests; if the x value is greater than 0, then y is 1; otherwise, 0. The binary step function was one of the first activation functions to be used in a neural network.

Although there is a massive problem with this function, if the x value is 3 or 3,00,000, the output is always 1. Similarly, any number less than zero gets converted to zero. This makes learning difficult. It is impossible to say what the value was before being fed into the function. If we do not know how close or far we are from the desired value, it is difficult for the optimisers to tweak the weights.

Linear Function

Source: Wikimedia | Linear function

A linear function is another activation function used earlier but hardly used anymore. Linear functions have difficulty fitting a non-linear function, which we generally encounter in most neural network problems.

Non-Linear Function

A non-linear function is a function that can not be represented using straight lines. The best example of a non-linear function is a sine function. Non-linear functions can give granular information about the data along with normalising the data.

Source: Wikimedia | Non-linear function

In the present day, all neural networks use non-linear activation functions because of the various advantages mentioned above. Hence, in this article, we will only discuss the different non-linear activation functions.

Types of non-linear function

1. Sigmoid function

The function formula and chart are as follows.

Sigmoid Function equation
Sigmoid Function Graph
Sigmoid Function Differentiation Graph

This function is added to the list just for the sake of completion. It is rarely used these days.

Advantages:

  • Smooth gradient, preventing “jumps” in output values.
  • Output values bound between 0 and 1, normalising the output of each neuron.
  • Precise predictions, i.e. very close to 1 or 0.

Disadvantages:

  • Prone to gradient vanishing
  • Function output is not zero-centred
  • Power operations are relatively time-consuming, i.e. computationally expensive.

2. Tanh function

Tanh Equation
Tanh Graph

This function as well is added just for the sake of completion. Tanh is a hyperbolic tangent function.

The curves of the tanh function and the sigmoid function are pretty similar. Let us contrast them. To begin with, when the input is large or little, the output is virtually smooth, and the gradient is modest, which is inconvenient for weight updating. The output interval is what distinguishes them.

It is slightly better than sigmoid as it is zero centric, but it still has the other two disadvantages, which outweigh the benefits.

3. ReLU function

ReLU Function Equation
ReLU Function Graph
ReLU Function Differentiation Graph

ReLU is almost linear except for the deviation after 0.

Advantages:

  • When the input is positive, there is no gradient vanishing problem.
  • The calculation speed is much faster. The ReLU function has only a linear relationship. Whether it is forward or backward, it is much faster than sigmoid and tanh. (sigmoid and tanh need to calculate the exponent, which will be slower.)

Disadvantages:

  • When the input is negative, ReLU is entirely inactive, implying that ReLU will die if a negative value is entered. In this approach, there is not an issue during the forward propagation process. Some places are sensitive, while others are not. However, if you provide a negative value during the backpropagation process, the gradient will be entirely zero, which is the same problem as the sigmoid and tanh functions.
  • We discover that the ReLU function’s output is either 0 or a positive integer, indicating that it is not a 0-centric function.

4. Leaky ReLU function

Leaky ReLU Equation
Leaky ReLU Function Graph. Here, a = 0.01

People recommended setting the first half of ReLU to 0.01x instead of 0 to fix the Dead ReLU Problem. Another simple concept is a parameter-based approach, Parametric ReLU: f(x)= max(alpha x,x), where alpha is learnt by backpropagation. In principle, Leaky ReLU has all of the benefits of ReLU, plus there will be no difficulties with Dead ReLU, but in practice, it is yet to be proven that Leaky ReLU is always better than ReLU.

5. ELU (Exponential Linear Units) function

ELU Function Equation
ELU Function Graph

ELU is also offered as a solution to ReLU’s issues. ELU has all of the benefits of ReLU, as well as:

  • There are no Dead ReLU problems.
  • The output’s mean is near 0 (zero-centred).

One minor drawback is that it requires somewhat more processing power. In the same manner that Leaky ReLU is theoretically superior to ReLU, there is presently no strong proof in practice that ELU is always better than ReLU.

6. PRelu (Parametric ReLU)

PRelu Equation

Above, yᵢ is any input on the ith channel, zero-centred negative slope, a learnable parameter. PReLU is slightly better compared to ELU as it is not computationally intensive.

PRelu Function Graph

if aᵢ=0, f becomes ReLU

if aᵢ>0, f becomes leaky ReLU

if aᵢ is a learnable parameter, f becomes PReLU

7. Softmax

Source: Wikipics | Softmax Function Graph
Softmax Function

The formula might scare you a little, but it is just a complex way of representing a mean of inputs raised to the power of e. Where e ≈ 2.71828.

It is clear from the graph that raising anything to the power e gives us a value in a zero-centred angle. This helps in emphasising the correctness of a result.

A mean of these values helps normalise the resultant values and bring them between 0 and 1.

8. Swish (A Self-Gated) Function

Swish Function Graph

y = x * sigmoid (x)

Swish’s design was influenced by the usage of sigmoid functions in LSTMs and highway networks for gating. To simplify the gating process, we utilise the same value for gating, known as self-gating.

Self-gating has the benefit of requiring only a single scalar input, whereas conventional gating requires several scalar inputs. This characteristic allows self-gated activation functions like Swish to readily substitute activation functions that accept a single scalar as input (like ReLU) without affecting the hidden capacity or number of parameters.

9. Maxout

Maxout Function Equation

The Maxout activation is a function that generalises the ReLU and leaky ReLU functions. Maxout may be considered adding an activation function layer to the deep learning network, a parameter k. In comparison to ReLU, sigmoid, and other layers, this layer is unique in that it adds k neurons and then outputs the highest activation value.

10. Softplus

Softplus Function Graph compared to ReLU

The softplus function is similar to the ReLU function, but it is relatively smooth. It is unilateral suppression like ReLU.It has a wide acceptance range (0, + inf).

Recommended usage

Sigmoid and tanh functions are rarely used these days because of their vanishing gradient problem. Hence, we should avoid them. ReLU and its types can be used in your models to start, and then we can choose other functions if we do not get the desired result.

Softmax is generally used in the output layer to normalise the output received and find how close the result was to the original value and by how much.

Conclusion

There are many activation functions to choose from, and I hope this article can help you cut through all the nonsense and give you a clear picture of which function to use for your network.

TLDR: When in doubt, use the ReLU function in the hidden layers and the softmax function in the output layer in case of classification problems.

Do let me know what you want me to write about next. Until next time, Happy learning!

--

--