Activation Functions — All You Need To Know!
So, what is an Activation Function?
An activation function is a function that is added to an artificial neural network in order to help the network learn complex patterns in the data. When comparing with a neuron-based model that is in our brains, the activation function is at the end deciding what is to be fired to the next neuron.
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be “ON” (1) or “OFF” (0), depending on input. — Wikipedia
So, summing it up, activation functions are mathematical equations that determine the output of a neural network.
In this blog, we will learn about — the widely used Activation Functions, the backend mathematics behind its working, and discuss various ways on how to choose the best one for your specific deep learning problem statement.
Before jumping in-depth about the different types of activation functions, let’s take a quick look into how an artificial neuron works -
A mathematical visualization of the processes described above can be shown as-
By now, you must be very well acquainted with the process of how an ANN works and what is the role of Activation Function in the process!
So, grab your coffee🥤, and let’s begin!
1. Sigmoid Activation Function -
The Sigmoid Function looks like an S-shaped curve.
Formula : f(z) = 1/(1+ e^-z)
Why and when do we use the Sigmoid Activation Function?
- The output of a sigmoid function ranges between 0 and 1. Since, output values bound between 0 and 1, it normalizes the output of each neuron.
- Specially used for models where we have to predict the probability as an output. Since the probability of anything exists only between the range of 0 and 1, sigmoid is the perfect choice.
- Smooth gradient, preventing “jumps” in output values.
- The function is differentiable.That means, we can find the slope of the sigmoid curve at any two points.
- Clear predictions, i.e very close to 1 or 0.
What are some disadvantages of the Sigmoid activation function?
- Prone to gradient vanishing (when the sigmoid function value is either too high or too low, the derivative becomes very small i.e. << 1. This causes vanishing gradients and poor learning for deep networks.)
- The function output is not centered on 0, which will reduce the efficiency of weight update.
- The sigmoid function performs exponential operations, which is slower for computers.
2. Tanh or Hyperbolic Tangent Activation Function -
The tanh activation function is also sort of sigmoidal (S-shaped).
Tanh is a hyperbolic tangent function. The curves of tanh function and sigmoid function are relatively similar. But it has some advantage over the sigmoid function. Let's look at what it is.
Why is tanh better compared to sigmoid activation function?
- First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval. The output interval of tanh is 1, and the whole function is 0-centric, which is better than sigmoid.
- The major advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.
Note: In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer. However, these are not static, and the specific activation function to be used must be analyzed according to the specific problem, or it depends on debugging.
3. ReLU (Rectified Linear Unit) Activation Function-
The ReLU is half rectified (from the bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
The ReLU (Rectified Linear Unit) function is an activation function that is currently more popular compared to other activation functions in deep learning.
Compared with the sigmoid function and the tanh function, it has the following advantages:
- When the input is positive, there is no gradient saturation problem.
- The calculation speed is much faster. The ReLU function has only a linear relationship. Whether it is forward or backward, it is much faster than sigmoid and tanh. (Sigmoid and tanh need to calculate the exponent, which will be slower.)
Of course, there are disadvantages:
1) Dead ReLU problem- When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the back propagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmoid function and tanh function.
2) We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function.
4. Leaky ReLU Activation Function-
An activation function specifically designed to compensate for the dying ReLU problem.
Why Leaky ReLU is better than ReLU?
- The leaky ReLU adjusts the problem of zero gradients for negative value, by giving a very small linear component of x to negative inputs(0.01x).
- The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.
- Range of the Leaky ReLU is (-infinity to infinity).
Note : In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.
5. ELU (Exponential Linear Units) function-
ELU is also proposed to solve the problems of ReLU. In contrast to ReLUs, ELUs have negative values which pushes the mean of the activations closer to zero. Mean activations that are closer to zero enable faster learning as they bring the gradient closer to the natural gradient.
Obviously, ELU has all the advantages of ReLU, and:
- No Dead ReLU issues,the mean of the output is close to 0, zero-centered.
- In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect.
- ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information.
One small problem is that it is slightly more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.
6. PRelu (Parametric ReLU)-
PReLU is also an improved version of ReLU.
We look at the formula of PReLU. The parameter α is generally a number between 0 and 1, and it is generally relatively small, such as a few zeros.
- if aᵢ=0, f becomes ReLU
- if aᵢ>0, f becomes leaky ReLU
- if aᵢ is a learnable parameter, f becomes PReLU
Coming to the advantages of PReLU-
- In the negative region, PReLU has a small slope, which can also avoid the problem of ReLU death.
- Compared to ELU, PReLU is a linear operation in the negative region. Although the slope is small, it does not tend to 0, which is a certain advantage.
7. Softmax
Softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels. For an arbitrary real vector of length K, Softmax can compress it into a real vector of length K with a value in the range (0, 1), and the sum of the elements in the vector is 1.
Softmax is different from the normal max function: the max function only outputs the largest value, and Softmax ensures that smaller values have a smaller probability and will not be discarded directly. It is a “max” that is “soft”; it can be thought to be a probabilistic or “softer” version of the argmax function.
The denominator of the Softmax function combines all factors of the original output value, which means that the different probabilities obtained by the Softmax function are related to each other.
The major drawback in the softmax activation function is that it is -
- Non-differentiable at zero and ReLU is unbounded.
2. The gradients for negative input are zero, which means for activations in that region, the weights are not updated during backpropagation. This can create dead neurons that never get activated.
8. Swish (A Self-Gated) Function
The formula is: y = x * sigmoid (x)
Swish’s design was inspired by the use of sigmoid functions for gating in LSTMs and highway networks. We use the same value for gating to simplify the gating mechanism, which is called self-gating.
The advantage of self-gating is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters.
Note: Swish activation function can only be implemented when your neural network is ≥ 40 layers.
The major advantages of the Swish activation function are as follows:
1.Unboundedness is helpful to prevent the gradient from gradually approaching 0 during slow training, causing saturation.
(At the same time, being bounded has advantages, because bounded active functions can have strong regularization, and larger negative inputs will be resolved.)
2. Derivative always > 0.
3. Smoothness also plays an important role in optimization and generalization.
9. Maxout
A Maxout layer is simply a layer where the activation function is the max of the inputs. As stated in the paper, even an MLP with 2 maxout units can approximate any function.
A single Maxout unit can be interpreted as making a piecewise linear approximation (PWL) to a real-valued function where the line segment between any two points on the graph of the function lies above or on the graph (convex function).
Maxout can also be implemented for a d-dimensional vector(V).
Consider, two convex functions h1(x) and h2(x), approximated by two Maxout units. By the above preposition, the function g(x) is a continuous PWL function.
Hence, it is found that a Maxout layer consisting of two Maxout units can approximate any continuous function arbitrarily well.
10. Softplus-
Softplus function: f(x) = ln(1+exp x)
The derivative of softplus is -
f ′(x)=exp(x) / ( 1+exp x )
= 1/ (1 +exp(−x ))
,which is also called the logistic/sigmoid function.
The softplus function is similar to the ReLU function, but it is relatively smooth. It is unilateral suppression like ReLU.
It has a wide acceptance range (0, + inf).
Generally speaking, these activation functions have their own advantages and disadvantages. All the good and bad must be obtained by experimenting them with various problem statements.
And, with this, we come to the end of this blog. Hope this will give you a decent knowledge about the most used activation functions in deep learning.
😎
If you are a beginner in Data Science and Machine Learning and have some specific queries with regard to Data Science/ML-AI, guidance for Career Transition to Data Science, Interview/Resume Preparation, or even want to get a Mock Interview before your D-Day, feel free to book a 1:1 call here. I will be happy to help!