Activation functions in Neural Networks

Published in

Analytics Vidhya

8 min readApr 29, 2020

The activation function is a non-linear transformation that we do over the input before sending it to the next layer of neurons or finalizing it as output

We need the activation function to introduce nonlinear real-world properties to artificial neural networks.

So Why Nonlinear Functions are Needed?

Functions with multiple degrees are called non-linear functions.

Neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.

A Neural Network without Activation function would simply be a Linear regression model, which has limited power and does not perform well most of the times

The Activation Functions can be basically divided into 2 types-

Linear Activation Function
Non-linear Activation Functions

Linear or Identity Activation Function

Linear Activation Function and Derivative

As you can see the function is a line or linear. Therefore, the output of the functions will not be confined between any range.

Equation : f(x) = x → output is same as input

Range : (-infinity to infinity)

It doesn’t help with the complexity of various parameters of usual data that is fed to the neural networks.

y=mx+c ( m is line equation represents W and c is represented as b in neural nets so the equation can be modified as y=Wx+b)

It takes the inputs (Xi’s), multiplied by the weights(Wi’s) for each neuron, and creates an output proportional to the input. In simple term, weighted sum input is proportional to output.

Problem with Linear function,

Linear function has limited power and ability to handle complexity. It can be used for a simple task like interpretability.
Differential result is constant.
All layers of the neural network collapse into one.
Linear function meaning Output of the first layer is same as the output of the nth layer.

Non-linear Activation Function

Most modern neural network uses the non-linear function as their activation function to fire the neuron. Reason being they allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modelling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality.

Advantage of Non-linear function over the Linear function :

Differential is possible in all the non -linear function.
Stacking of network is possible, which helps us in creating deep neural nets.
It makes it easy for the model to generalize or adapt with a variety of data and to differentiate between the output.

The main terminologies needed to understand for nonlinear functions are:

Derivative or Differential: Change in y-axis w.r.t. change in the x-axis. It is also known as a slope.

Monotonic function: A function which is either entirely non-increasing or non-decreasing.

1. Sigmoid or Logistic Activation Function:-

Sigmoid activation function and Derivative

The Sigmoid Function curve looks like an S-shape.

The main reason why we use the sigmoid function is that it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since the probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

The function is differentiable. That means, we can find the slope of the sigmoid curve at any two points.

The function is monotonic but the function’s derivative is not.

The logistic sigmoid function can cause a neural network to get stuck at the training time.

*** So what’s the problem with sigmoid function?

If we look carefully at the graph towards the ends of the function, y values react very little to the changes in x.

Let’s think about what kind of problem it is! The derivative values in these regions are very small and converge to 0. This is called the vanishing gradient and the learning is minimal. if 0, not any learning! When slow learning occurs, the optimization algorithm that minimizes error can be attached to local minimum values and cannot get maximum performance from the artificial neural network model.

2. Tanh or hyperbolic tangent Activation Function

Tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s-shaped).

It has a structure very similar to Sigmoid function. However, this time the function is defined as (-1, + 1). The advantage over the sigmoid function is that its derivative is steeper, which means it can get more value. This means that it will be more efficient because it has a wider range for faster learning and grading. But again, the problem of gradients at the ends of the function continues.

3. ReLU (Rectified Linear Unit) Activation Function

The ReLU is the most used activation function in the world right now. Since it is used in almost all the convolutional neural networks or deep learning.

As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero.

Range: [ 0 to infinity)

The function and it's derivative both are monotonic.

But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately.

Let’s imagine a large neural network with too many neurons. Sigmoid and hyperbolic tangent caused almost all neurons to be activated in the same way. This means that the activation is very intensive. Some of the neurons in the network are active, and activation is infrequent, so we want an efficient computational load. We get it with ReLU. Having a value of 0 on the negative axis means that the network will run faster. The fact that the calculation load is less than the sigmoid and hyperbolic tangent functions has led to a higher preference for multi-layer networks.

But even ReLU isn’t exactly great, why? Because of this zero value region that gives us the speed of the process! So the learning is not happening in that area. Then you need to find a new activation function with a trick.

4. Leaky ReLU

Leak on the negative plane is an attempt to solve the dying ReLU problem

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.

When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives also monotonic in nature.

5. ELU (Exponential Linear Units) Activation Function:

ELU is also proposed to solve the problem of dying neuron.
No Dead ReLU issues
Zero-centric

Cons:

Computationally intensive.
Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.
f(x) is monotonic only if alpha is greater than or equal to 0.
f’(x) derivative of ELU is monotonic only if alpha lies between 0 and 1.
Slow convergence due to exponential function.

6. P ReLu (Parametric ReLU) Activation Function:

The idea of leaky ReLU can be extended even further.
Instead of multiplying x with a constant term, we can multiply it with a “hyper-parameter (a-trainable parameter)” which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.
The parameter α is generally a number between 0 and 1, and it is generally relatively small.
Have a slight advantage over Leaky Relu due to trainable parameter.
Handle the problem of dying neuron.

Cons:

Same as leaky Relu.
f(x) is monotonic when a> or =0 and f’(x) is monotonic when a =1

7. Swish (A Self-Gated) Activation Function:(Sigmoid Linear Unit)

Swish activation function and derivative

Google Brain Team has proposed a new activation function, named Swish, which is simply f(x) = x · sigmoid(x).
Their experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets.
The curve of the Swish function is smooth and the function is differentiable at all points. This is helpful during the model optimization process and is considered to be one of the reasons that swish outperforms ReLU.
Swish function is “not monotonic”. This means that the value of the function may decrease even when the input values are increasing.
Function is unbounded above and bounded below.

“Swish tends to continuously match or out-form the ReLu”

Note that the output of the swish function may fall even when the input increases. This is an interesting and swish-specific feature. (Due to non-monotonic character)

8. Softmax Function

The “softmax” function is also a type of sigmoid function but it is very useful to handle multi-class classification problems.
“Softmax can be described as a combination of multiple sigmoidal functions.”
“Softmax function returns the probability for a data-point belonging to each individual class.”
“Note that the sum of all the values is 1.”