Activation Functions In Neural Network

Gaurav Rajpal
Oct 1 · 11 min read
Image for post
Image for post
Source: Internet

In this blog we will learn about the activation function which are most widely used in Deep Learning. Before jumping to the point lets recap in short about the basic architecture of neural network and understand it’s working in short.

For simplicity purpose consider the multilayer perceptron.

Image for post
Image for post
Source: Analytics Vidhya

We had set of inputs. These inputs where send to hidden layer and we calculated the Z value which was multiplication of input features and some weights. After this operation we used activation function to calculate output H. We used sigmoid activation function on Z. The results where then sent to next layer where we had some weights and bias and we then calculated output O.

For more detail you can refer my previous blog for better understanding https://medium.com/@gauravrajpal1994/introduction-to-neural-networks-1d111bb4649

The question arises why do we need activation function at all. To resolve this question lets understand it better.

One main reason is that it could not capture the complex relationship within the data. Also, we would be left with just linear combination of inputs and weights along with bias.

Let us visualize using TensorFlow Playground.

Image for post
Image for post
Source: Analytics Vidhya

In the TensorFlow Playground we can see on left we have multiple data options. In middle we have our architecture defined. We have inputs X1 and X2 and single hidden layer and then output layer. We can also see an option to change activation function, learning rate etc. The objective is to create a decision boundary that separates orange points from blue points. So, lets see after having activation function as linear what was the result.

We will observe in the figure that in less than 100 epochs it could create best decision boundary for the simple dataset which we had chosen, and train and test set loss was 0.

Image for post
Image for post
Source : Analytics Vidhya

Let us now select another dataset in Tensor Flow Playground where decision boundary is not that easy to create. It is circular one and later while training the model using linear activation function we will see that we cannot derive at the best decision boundary which separates the orange and blue points even after more number of epochs and no improvement in train and test loss.

Image for post
Image for post
Source : Analytics Vidhya

Now let us visualize the decision boundary when we select sigmoid activation function keeping everything the same.

We will see that model is able to train well and able to classify the two types of data points properly in less than 200 epochs.

Image for post
Image for post
Source : Analytics Vidhya

So, changing the activation function from linear to sigmoid added non-linearity in the network which made the network strong enough to capture the relationship within the data. Hope now the use of activation function is clear among you guys?

Now, since you are aware about the importance of activation function lets dig dive into the types of the activation functions used in Deep Learning.

TYPES OF ACTIVATION FUNCTION.

Linear activation function is the simplest activation function. It does not capture any non-linearity in the data as we observed earlier.

The mathematical equation for linear activation function is y = ax which says that for any input x, output will be a times of x.

Consider a = 1, the graph will be looking like,

Image for post
Image for post
Source : Analytics Vidhya

Based on the above result observed we can say that the input can be any value ranging from (- infinity, + infinity) same goes with output. This is one of the condition for activation function to be continuous.

Image for post
Image for post
Source : Analytics Vidhya

Second condition of activation function which says that, “It should be differentiable at every point.” . Let us look at derivative of linear activation function. We will see when we take derivative w.r.t x we will get co-efficient of x i.e. a.

Image for post
Image for post
Source : Analytics Vidhya

This is the simplest activation function and does not capture the non-relationship within the data. This function is often used at the output layer of regression problem. Consider the example where we need to predict the income based age, experience and qualification.

Image for post
Image for post
Source : Analytics Vidhya

It is most of the popular used activation function and we have used it earlier as well to demonstrate how it is useful in capturing non-linearity in the data via Tensor Flow Playground.

The mathematical equation of sigmoid activation function is as follows:

Image for post
Image for post
Source : Analytics Vidhya

The best part of sigmoid activation function is that it restricts the output values between 0 and 1. The values are generally treated as probabilities and hence sigmoid function is generally used at the output layer where we need to calculate the probability of the classes. Also, from the graph above we can see that sigmoid activation function is continuous and differentiable at each and every point.

Let us look at the derivative of SIGMOID activation function.

Image for post
Image for post
Source : Analytics Vidhya

From the above plot we can see that the curve is quite flat which means that the gradient or the derivative value of this activation function will be quite small.

For better understanding of the steps how we arrived at the derivative of sigmoid activation function you all can refer the link below.

TANH activation function is quite similar to sigmoid activation function. We can say it is scaled version of sigmoid activation function.

The mathematical equation of tanh activation function is as follows :

Image for post
Image for post
Source : Analytics Vidhya

In tanh activation function the output values are between (-1, 1) where as in sigmoid activation function we saw the output values range from (0,1).

Image for post
Image for post
Source : Analytics Vidhya

From the graph above we can see tanh function is steeper at the center around 0. Also this graph makes it clear that TANH is scaled version of SIGMOID.

Let us look at the derivative of the TANH activation function.

Image for post
Image for post
Source : Analytics Vidhya

Compared to sigmoid activation function, the values of tanh activation function are comparatively larger. Hence training is faster in terms of tanh as the gradient values will be larger and updating weights would be faster.

ReLU stands for Rectified Liner Unit. It is one of the most commonly used activation function in deep learning.

This function returns 0 for all the negative values and for any value greater than 0 the function returns the same output. Lets look at the equation below.

Image for post
Image for post
Source : Analytics Vidhya

We can see that for all the values greater than 0 it acts like a linear function and can also be represented as max(0, x), where x is any real number. Also it is clear that any negative input values of weights the result would be 0 that means neurons are not activated in forward propagation process. Since only certain number of neurons are activated ReLU activation function is computationally effective as compared to sigmoid and tanh activation function.

Image for post
Image for post
Source : Analytics Vidhya

Going back to the TANH and SIGMOID activation function we saw that they both are differentiable at each and every point but coming to ReLU activation function, we see that it is not differentiable at point x = 0.

Let us look at the derivative of ReLU activation function.

Image for post
Image for post
Source : Analytics Vidhya

As we see for all the value greater than 0 it’s value of derivative is 1 and for values less than 0 it is 0. It’s derivative is not defined at value x=0.

For implementation purposes the value of derivative at x=0 is considered to be 0.

There is still one such problem with this function. The derivation of some neurons still becomes 0 hence some of the weights and bias are not updated. So to resolve this issue we have another activation function.

Leaky ReLU is an activation function which overcomes the disadvantage encountered in ReLU layer i.e. the derivation of some neurons becoming 0. To resolve this issue it returns a small value 0.01 of x for x < 0 instead of 0.

Let us look at equation below.

Image for post
Image for post
Source : Analytics Vidhya

Lets look at the derivative of Leaky ReLU activation function.

Image for post
Image for post
Source : Analytics Vidhya

So, when we calculate the derivative of leaky relu activation function it will be 0.01 for all values of x ≤ 0 and 1 for all values of x > 0.

SoftMax activation is generally used for multiclass classification.

Before jumping into as to why this activation function is used for multiclass classification let us first understand what exactly is multiclass classification problem. For eg:

Consider below figure, for each observation we have 5 features and target variable has 3 classes (Class 1, Class 2 and Class 3)

Image for post
Image for post
Source : Analytics Vidhya

Let us create simple neural network for the problem discussed above. We will see we have 5 input features in the input layer. Next we have 1 hidden layer which has 4 neurons. Obviously we can increase the number of neurons and number of layers in the architecture but for now we are considering only neuron with 4 hidden layer. Each of these neurons use input, weights and bias to calculate the value Z represented by Zij (1st neuron of 1st layer, we call it Z11 and so on). Over these values we apply activation functions and send the result to output layer.

Now can you guess the number of neurons in the output layer ???

If you guessed it, 3 you were right as we were having 3 classes in our target variable of our data set. Each of the individual neuron will give you the probability of individual classes.

Image for post
Image for post
Source : Analytics Vidhya

In the above figure, we can see that 1st neuron in the output layer will give us the probability of it belonging to Class 1 . Similarly 2nd neuron will give us the probability of it belonging to Class 2 and finally 3rd neuron will give us the probability of it belonging to Class 3.

Now, suppose we calculate the Z value using the weights and bias of the output layer and apply sigmoid activation function knowing that sigmoid activation function gives us the value between 0 and 1, we will get some output values.

Image for post
Image for post
Source : Analytics Vidhya

If we think deeper , we can see that we will encounter 2 problem in this case. First, if we apply the threshold = 0.5, It will say us that the input layer belongs to 2 classes (Class 1: 0.84 and Class 2: 0.67). Secondly , the probability values are independent of each other (probability that the data point belongs to class 1 does not take into account probability of other 2 classes).

Image for post
Image for post

Using the SoftMax activation we can get the relative probabilities which means that it uses the probabilities values of multiple classes in the target to calculate the final output.

Let us see how does SoftMax activation function works.

Image for post
Image for post
Source: Internet

SoftMax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1.

SoftMax turn (numeric output of the last linear layer of a multi-class classification neural network) into probabilities by take the exponents of each output and then normalize each number by the sum of those exponents so the entire output vector adds up to one.

Let us consider step wise what happens basically (Assumption):

STEP 1: Assume we got following values for output layer.

Image for post
Image for post
Source : Analytics Vidhya

STEP 2: Applying SoftMax activation function to each of these neurons.

Image for post
Image for post
Source : Analytics Vidhya

We must note that, these are the probability values for the input data point belonging to the respective classes. We must note that sum of the probabilities in these case is 1. So in this case it is clear that the input belongs to class 1. Also if the probability values of any classes change, probability value for class 1 will also change.

This is all about SOFTMAX activation function. Hope you understood it ?

HOW TO CHOOSE ACTIVATION FUNCTION FOR OUR NEURAL NETWORK ?

You guys may be wondering that up till now we have studied the various activation function and looked at their mathematical equations and derivatives, understood the terminology as to why they are useful. In this part we will explore which activation function we can use for our neural network.

  1. Linear Activation Function

It is used in for REGRESSION type of problem at output layer where the target variable is continuous. As we already discussed linear activation function cannot capture non linearity in the data hence it is preferred to have it at output layer while we can use non-linear functions like RELU and TANH over the hidden layer.

2. Sigmoid Activation Function

As we already know it returns values between 0 and 1 which are treated as probabilities of output classes. Generally it is use for BINARY CLASSIFICATION PROBLEM while we can use other activation function at the hidden layer.

3. ReLU & TanH Activation Function

These activation functions are popularly used for HIDDEN LAYERS of the neural network. Infact ReLU activation function has shown to be performed better than other activation function and is the popular choice.

4. Softmax Activation Function

Similar to sigmoid activation function, softmax activation function returns the probabilities of each class and is used at the output layer and most frequently used in MULTICLASS CLASSIFICATON.

CONCLUSION

This bring to the end of our topic of discussion. Hope you enjoyed it and liked exploring the theoretical concept behind activation function used in Deep Learning. If so please please like it and give it a clap.

Do connect with me on LinkedIn : https://www.linkedin.com/in/gaurav-rajpal/

Regards,

Gaurav Rajpal (gauravrajpal1994@gmail.com)

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Gaurav Rajpal

Written by

Aspiring Data Scientist | Blogger | ML, DL enthusiastic. Having overall 3.4 years industry relevant experience.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Gaurav Rajpal

Written by

Aspiring Data Scientist | Blogger | ML, DL enthusiastic. Having overall 3.4 years industry relevant experience.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store