## DATA SCIENCE THEORY | NEURAL NETWORKS | ACTIVATION FUNCTIONS

# Neural Networks-Part(2): Activation Functions

## A friendly guide to the most widely used neural network activation functions

# Activation Functions

In the previous article Neural Networks-Part(1): Introduction to Neuron and Single Neuron Neural Network, we mentioned activation functions. Just like Optimus Prime transforms, *activation functions help us transform the data each neuron has been fed before giving it to the next neuron in line. *There are two solid reasons why we need to transform our data:

— **To provide non-linearity to the data.**

— **To ensure the gradients remain large enough through all the hidden layers.**

We haven’t talked about layers in a neural network yet but a brief intuition is that layers in neural networks are made of one or multiple neurons. We usually add several layers between the input and output layers, and when a model is prepared for production, the user usually has access only to the input and output layers of the model and not the layers in between. For this reason, the latter are conventionally called hidden layers. Why do we need hidden layers? Because neural networks with (at least) one hidden layer can be used to approximate any continuous function. While adding hidden layers increases the complexity of the model, this increased complexity might also lead to higher accuracy. For example, each layer may help us understand different features of the data better and, thus, enhance the power of our model. This will be illustrated better in the next articles.

Let’s get back on track and learn more about activation functions. There are some properties an activation function should follow: *(i)**The function should be defined for everything (-infinity, infinity).* *(ii)** The derivative of the function should be continuous at each point (Note: ReLU represents an exception because it is differentiable at all the points except 0).* *(iii) **The function should be monotonically increasing. *There are various activation functions that we can use but the following are some of the most common ones and also those we will be discussing:

- Sigmoid Function
- Hyperbolic Tangent Function
- Softmax Function
- Rectified Linear Unit (ReLU) Function

# 1. Sigmoid Function

A

sigmoid functionis a mathematical function having a characteristic “S”-shaped curve orsigmoid curve. It is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point and exactly one inflection point (Wikipedia).

It is represented as:

The function is defined in the domain (-infinity, infinity) and shows a return value (y axis) in the range from [0,1]. Hence, it keeps the values** **near the origin.** **This constitutes an advantage since neural networks learn more efficiently when initialized with small random values. The sigmoid function is widely used in logistic regression problems when the decision boundary is non-linear. Non-linear activation allows the model to recognize this non-linear decision boundary. There are disadvantages of using the sigmoid function *(i)** the derivative vanishes for values far from the origin. **(ii) **If you use sigmoid in the output layer of your classifier, there is a chance the transformed values might not sum to 1.*

# 2. Hyperbolic Tangent Function

Hyperbolic Tangent Function, or sometimes referred to as the tanh function, seems to be very similar to the sigmoid function described above. It has the same domain as the sigmoid function but the range changes to [-1,1]. A more positive input will tend to be in the range [0,1] and a negative one to be in the range [-1,0]. It is represented as:

The tanh function generally performs better than the sigmoid function but has the similar drawback of vanishing gradient.

**3. Softmax Function**

Softmaxis a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector (Jason Browlee).

What it means is that if you have a vector, it normalizes each value with respect to the others. For example, if you have 5 different numbers, you sum them and want to find which number holds what proportion in that sum, so that proportion will be the output of the softmax activation function.

Example:

Numbers: 5, 7, 6, 4

The sum of these numbers: 5+7+6+4=22

What is the contribution of number 4 in this sum? We do,

4/22= 0.18 , 18% roughly.

Similarly, we calculate it for other numbers,

5/22=0.22

7/22=0.32

6/22=0.28

Notice, the sum of these proportions results in 1, which is also the max range of probability (remember it ranges from 0 to 1), and hence the softmax function is an excellent transformation to obtain probabilities. That’s the reason why it is widely used as an activation function in the output layer of neural networks for multi-class classification problems. It does not simply add the values and calculate probabilities, but it also maps them to an exponential function and is represented as:

Where *p* is the total number of inputs in a vector or the number of classes. For each class, we will have an output node in the output layer. You can apply the above formula to the above example and play around.

The softmax function is often preferred over the sigmoid function, since the latter will provide a value between 0 and 1 for each class in a multi-class classification problem but not relative to other classes. Hence, when you sum up the outputs, they will result in a value greater than 1, which is not desirable for interpretation purposes.

Softmax scores for an input [1, 3.25, 5.5, 7.75, 10].

# 4. Rectified Linear Unit (ReLU) Function

The ReLU activation function is a mathematical function that produces an output of 0 if the input is negative, or the input itself if it is positive. It finds its use case in many modern neural networks and is very commonly used for its effectiveness in model training.

Some benefits of using the ReLU function are *(i)** The ReLU activation function solves the problem of vanishing gradients of tanh and sigmoid functions.* *(ii) **Since zero values are introduced for negative inputs, model sparsity increases making training easier. **(iii) **It is computationally less expensive, since it only requires a max() function, whereas tanh and sigmoid use the more complex exponential function.** (iv) **Its behaviour is almost linear, which is an advantage, since optimization is easier for neural nets with linear or close to linear behaviour.*

That is it for today, thank you very much!