In this blog we will learn about the activation function which are most widely used in Deep Learning. Before jumping to the point lets recap in short about the basic architecture of neural network and understand it’s working in short.
For simplicity purpose consider the multilayer perceptron.
We had set of inputs. These inputs where send to hidden layer and we calculated the Z value which was multiplication of input features and some weights. After this operation we used activation function to calculate output H. We used sigmoid activation function on Z. The results where then sent to next layer where we had some weights and bias and we then calculated output O.
For more detail you can refer my previous blog for better understanding https://medium.com/@gauravrajpal1994/introduction-to-neural-networks-1d111bb4649
The question arises why do we need activation function at all. To resolve this question lets understand it better.
One main reason is that it could not capture the complex relationship within the data. Also, we would be left with just linear combination of inputs and weights along with bias.
Let us visualize using TensorFlow Playground.
In the TensorFlow Playground we can see on left we have multiple data options. In middle we have our architecture defined. We have inputs X1 and X2 and single hidden layer and then output layer. We can also see an option to change activation function, learning rate etc. The objective is to create a decision boundary that separates orange points from blue points. So, lets see after having activation function as linear what was the result.
We will observe in the figure that in less than 100 epochs it could create best decision boundary for the simple dataset which we had chosen, and train and test set loss was 0.
Let us now select another dataset in Tensor Flow Playground where decision boundary is not that easy to create. It is circular one and later while training the model using linear activation function we will see that we cannot derive at the best decision boundary which separates the orange and blue points even after more number of epochs and no improvement in train and test loss.
Now let us visualize the decision boundary when we select sigmoid activation function keeping everything the same.
We will see that model is able to train well and able to classify the two types of data points properly in less than 200 epochs.
So, changing the activation function from linear to sigmoid added non-linearity in the network which made the network strong enough to capture the relationship within the data. Hope now the use of activation function is clear among you guys?
Now, since you are aware about the importance of activation function lets dig dive into the types of the activation functions used in Deep Learning.
TYPES OF ACTIVATION FUNCTION.
LINEAR ACTIVATION FUNCTION
Linear activation function is the simplest activation function. It does not capture any non-linearity in the data as we observed earlier.
The mathematical equation for linear activation function is y = ax which says that for any input x, output will be a times of x.
Consider a = 1, the graph will be looking like,
Based on the above result observed we can say that the input can be any value ranging from (- infinity, + infinity) same goes with output. This is one of the condition for activation function to be continuous.
Second condition of activation function which says that, “It should be differentiable at every point.” . Let us look at derivative of linear activation function. We will see when we take derivative w.r.t x we will get co-efficient of x i.e. a.
This is the simplest activation function and does not capture the non-relationship within the data. This function is often used at the output layer of regression problem. Consider the example where we need to predict the income based age, experience and qualification.
SIGMOID ACTIVATION FUNCTION
It is most of the popular used activation function and we have used it earlier as well to demonstrate how it is useful in capturing non-linearity in the data via Tensor Flow Playground.
The mathematical equation of sigmoid activation function is as follows:
The best part of sigmoid activation function is that it restricts the output values between 0 and 1. The values are generally treated as probabilities and hence sigmoid function is generally used at the output layer where we need to calculate the probability of the classes. Also, from the graph above we can see that sigmoid activation function is continuous and differentiable at each and every point.
Let us look at the derivative of SIGMOID activation function.
From the above plot we can see that the curve is quite flat which means that the gradient or the derivative value of this activation function will be quite small.
For better understanding of the steps how we arrived at the derivative of sigmoid activation function you all can refer the link below.
Derivative of the Sigmoid function
In this article, we will see the complete derivation of the Sigmoid function as used in Artificial Intelligence…
TANH ACTIVATION FUNCTION
TANH activation function is quite similar to sigmoid activation function. We can say it is scaled version of sigmoid activation function.
The mathematical equation of tanh activation function is as follows :
In tanh activation function the output values are between (-1, 1) where as in sigmoid activation function we saw the output values range from (0,1).
DIFFERENCE BETWEEN SIGMOID & TANH
From the graph above we can see tanh function is steeper at the center around 0. Also this graph makes it clear that TANH is scaled version of SIGMOID.
Let us look at the derivative of the TANH activation function.
Compared to sigmoid activation function, the values of tanh activation function are comparatively larger. Hence training is faster in terms of tanh as the gradient values will be larger and updating weights would be faster.
ReLU ACTIVATION FUNCTION
ReLU stands for Rectified Liner Unit. It is one of the most commonly used activation function in deep learning.
This function returns 0 for all the negative values and for any value greater than 0 the function returns the same output. Lets look at the equation below.
We can see that for all the values greater than 0 it acts like a linear function and can also be represented as max(0, x), where x is any real number. Also it is clear that any negative input values of weights the result would be 0 that means neurons are not activated in forward propagation process. Since only certain number of neurons are activated ReLU activation function is computationally effective as compared to sigmoid and tanh activation function.
DIFFERENCE BETWEEN ReLU, TANH & SIGMOID
Going back to the TANH and SIGMOID activation function we saw that they both are differentiable at each and every point but coming to ReLU activation function, we see that it is not differentiable at point x = 0.
Let us look at the derivative of ReLU activation function.
As we see for all the value greater than 0 it’s value of derivative is 1 and for values less than 0 it is 0. It’s derivative is not defined at value x=0.
For implementation purposes the value of derivative at x=0 is considered to be 0.
There is still one such problem with this function. The derivation of some neurons still becomes 0 hence some of the weights and bias are not updated. So to resolve this issue we have another activation function.
LEAKY ReLU ACTIVATION FUNCTION
Leaky ReLU is an activation function which overcomes the disadvantage encountered in ReLU layer i.e. the derivation of some neurons becoming 0. To resolve this issue it returns a small value 0.01 of x for x < 0 instead of 0.
Let us look at equation below.
Lets look at the derivative of Leaky ReLU activation function.
So, when we calculate the derivative of leaky relu activation function it will be 0.01 for all values of x ≤ 0 and 1 for all values of x > 0.
SOFTMAX ACTIVATION FUNCTION
SoftMax activation is generally used for multiclass classification.
Before jumping into as to why this activation function is used for multiclass classification let us first understand what exactly is multiclass classification problem. For eg:
Consider below figure, for each observation we have 5 features and target variable has 3 classes (Class 1, Class 2 and Class 3)
Let us create simple neural network for the problem discussed above. We will see we have 5 input features in the input layer. Next we have 1 hidden layer which has 4 neurons. Obviously we can increase the number of neurons and number of layers in the architecture but for now we are considering only neuron with 4 hidden layer. Each of these neurons use input, weights and bias to calculate the value Z represented by Zij (1st neuron of 1st layer, we call it Z11 and so on). Over these values we apply activation functions and send the result to output layer.
Now can you guess the number of neurons in the output layer ???
If you guessed it, 3 you were right as we were having 3 classes in our target variable of our data set. Each of the individual neuron will give you the probability of individual classes.
In the above figure, we can see that 1st neuron in the output layer will give us the probability of it belonging to Class 1 . Similarly 2nd neuron will give us the probability of it belonging to Class 2 and finally 3rd neuron will give us the probability of it belonging to Class 3.
Now, suppose we calculate the Z value using the weights and bias of the output layer and apply sigmoid activation function knowing that sigmoid activation function gives us the value between 0 and 1, we will get some output values.
If we think deeper , we can see that we will encounter 2 problem in this case. First, if we apply the threshold = 0.5, It will say us that the input layer belongs to 2 classes (Class 1: 0.84 and Class 2: 0.67). Secondly , the probability values are independent of each other (probability that the data point belongs to class 1 does not take into account probability of other 2 classes).
This is the reason why SIGMOID activation function is not preferred for multiclass classification problem. So instead of SIGMOID we use SOFTMAX activation function.
Using the SoftMax activation we can get the relative probabilities which means that it uses the probabilities values of multiple classes in the target to calculate the final output.
Let us see how does SoftMax activation function works.
SoftMax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1.
In deep learning, the term logits layer is popularly used for the last neuron layer of neural network for classification task which produces raw prediction values as real numbers ranging from [-infinity, +infinity ]. — Wikipedia
SoftMax turn logits (numeric output of the last linear layer of a multi-class classification neural network) into probabilities by take the exponents of each output and then normalize each number by the sum of those exponents so the entire output vector adds up to one.
Let us consider step wise what happens basically (Assumption):
STEP 1: Assume we got following values for output layer.
STEP 2: Applying SoftMax activation function to each of these neurons.
We must note that, these are the probability values for the input data point belonging to the respective classes. We must note that sum of the probabilities in these case is 1. So in this case it is clear that the input belongs to class 1. Also if the probability values of any classes change, probability value for class 1 will also change.
This is all about SOFTMAX activation function. Hope you understood it ?
HOW TO CHOOSE ACTIVATION FUNCTION FOR OUR NEURAL NETWORK ?
You guys may be wondering that up till now we have studied the various activation function and looked at their mathematical equations and derivatives, understood the terminology as to why they are useful. In this part we will explore which activation function we can use for our neural network.
- Linear Activation Function
It is used in for REGRESSION type of problem at output layer where the target variable is continuous. As we already discussed linear activation function cannot capture non linearity in the data hence it is preferred to have it at output layer while we can use non-linear functions like RELU and TANH over the hidden layer.
2. Sigmoid Activation Function
As we already know it returns values between 0 and 1 which are treated as probabilities of output classes. Generally it is use for BINARY CLASSIFICATION PROBLEM while we can use other activation function at the hidden layer.
3. ReLU & TanH Activation Function
These activation functions are popularly used for HIDDEN LAYERS of the neural network. Infact ReLU activation function has shown to be performed better than other activation function and is the popular choice.
4. Softmax Activation Function
Similar to sigmoid activation function, softmax activation function returns the probabilities of each class and is used at the output layer and most frequently used in MULTICLASS CLASSIFICATON.
This bring to the end of our topic of discussion. Hope you enjoyed it and liked exploring the theoretical concept behind activation function used in Deep Learning. If so please please like it and give it a clap.
Do connect with me on LinkedIn : https://www.linkedin.com/in/gaurav-rajpal/
Stay tuned for further updates on Optimizers / Loss Functions and demo projects on Deep Learning.
Gaurav Rajpal (firstname.lastname@example.org)