ACTIVATION FUNCTIONS

anurag rakesh
10 min readAug 11, 2019

--

Neural networks is mainly used to predict the output when the input consists of large datasets of complex patterns(that are very hard to be identified by the human eye) by using the combinations of input layer, multiple hidden layers that have non-linear activation functions and an output layer.

How the neural network works?

First the inputs are given to the nodes, then it passes the value to the hidden layers that comprises of multiple layers. Each layer contains a set of nodes which takes a group of weighted inputs, applies an activation function, and returns an output. Weight to an input path decides the contribution of the input to the output of the node.

Why the layers between the input and output layers are called hidden layers ?

Hidden layers adds additional transformation( non-linearity) to the inputs to solve complex problems. It is called hidden because all the processing/estimation of weights and bias is done in these layers and no one can know the functional output generated by these layers.

Let’s take an example of how to estimate the cost of a Computer.

The output of each node in these layers are generated by activation functions which helps the output layer to understand the inputs clearly.

Let us understand the activation functions in depth!

ACTIVATION FUNCTIONS

What is an activation function?

Activation function is a function which mathematically calculates the weighted sum of all the inputs coming to the node and adds a bias with it, then it modifies the data they receive by some meaningful functions into some relative value before passing it to the next layer.

The bias value allows the activation function to be shifted to the left(when bias increases) or right(when bias decreases), to better fit the data. Certain changes to the weights can alter the steepness of curve, whilst the bias balances it.

Function with no bias (left) and Function with bias (right).

Why we use an activation function?

The significance of the activation function is to introduce non linear properties to the output of a node by passing the linear sum through non-linear functions known as activation functions.

Wow! this is very technical. Lets understand it in an another way.

Activation function maps the input to a meaningful output which can be used by another layers to add multiple features so that the system will predict the correct output and if it doesn’t predict the target feature than it will undergo training.

We need the activation functions to be differnentiable so they allows us to train iteratively using optimization technique like gradient descent. Gradient descent find the values of parameters( weights and bias) of a function that minimizes the cost function( difference between the prediction and target value) .

The changing of weights with the help of gradient.
Gradient Descent in 3-D

After each cycle of training, cost function( error) is calculated. The derivative of this cost function is calculated and propagated back through the network using a technique called backpropagation. Each nodes weights are then adjusted relative to how much they contributed to the total error. This process is repeated iteratively until the network error drops below an acceptable threshold.

What if we do not use an activation function?

If we do not use an activation function then the Neural Network will not be able to learn unstructured data like images, audio data, videos and text because the output will simply act as a linear regression model which has limited power.

It is also known as transfer functions because it can attached in between two neural networks.

The Activation Functions can be basically divided into two types-

Linear Activation Function

Non-linear Activation Functions

Linear activation functions

Equation : y = ax+b

Range : -inf to +inf

Linear activation function (blue) and its derivative (red)

Where it is implemented?

Mainly it is used in an output layer. We can connect a few nodes together and if more than one activates, we could take their maximum and decide the output. It gives a range of activation, so it is not binary activation which only tells YES/NO or 1/0.

What are the problems in it?

If we differentiate linear function to bring non-linearity, result will no more depend on input “x” and function will become constant. If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input.

No matter how many layers we have, if all layer’s node are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer.

Application : Calculation of price of a house is a regression problem. House price may have any big/small value, so we can apply linear activation at output layer. Even in this case neural network must have any non-linear function at hidden layers.

Non-Linear Activation Functions:

The Nonlinear Activation Functions are mainly divided on the basis of their range or curves

Sigmoid Function :-

It is a function which is plotted as ‘S’ shaped graph.

Equation : A = 1/(1 + e^-x)

Nature : Non-linear. for example suppose that X values lies between -2 to 2, Y values are very steep. This means, small changes in x would also bring about large changes in the value of Y.

Value Range : 0 to 1

Image result for vanishing gradient sigmoid in neural network

Where it is implemented?

It is especially used for models where we have to predict the probability as an output.Usually used in output layer of a binary classification. Sigmoid is still very popular in classification problems.

Why it is implemented?

Its result is either 0 or 1, as the sigmoid function lies between 0 and 1 only. So, result can be predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.

What are the problems in it?

Vanishing gradient problem- Towards either end of the sigmoid function, when X values increases/decreases drastically the Y values tend to increase/decrease very less. So, the gradient descent of that area is going to be small. It gives rise to a problem of “vanishing gradients”. Therefore, the modification of the weights and the bias will be negligible in backpropagation during training. The Earlier layers in the network are slowest to train. The network refuses to learn further or is drastically slow.

The two circles show the saturation region of the gradient descent

Its output isn’t zero centered. It makes the gradient descent updates go too far in different directions(means sometimes highly positive and sometimes highly negative). 0 < output < 1, and it makes optimization harder.

Sigmoid saturate near Y values of 0/1. This means large negative numbers become 0 and large positive numbers become 1. Since, sigmoid have slow convergence due to the fact that maximum value of its derivative is 0.25, this means that the update of the values of weights and bias will be small.

Application : In ecology: modeling population growth, In medicine: modeling of growth of tumors.

Tanh Function :-

The activation that works almost always better than sigmoid function is Tanh function also knows as Tangent Hyperbolic function. It’s actually mathematically shifted version of the sigmoid function. Both are similar and can be derived from each other.

Equation :-

f(x) = tanh(x) = 2/(1 + e^-2x) — 1

OR

tanh(x) = 2 * sigmoid(2x) — 1

Value Range :- -1 to +1

Nature :- non-linear

Where it is implemented?

Usually used in hidden layers of a neural network because its output is zero centered which helps the learning for the next layer much easier.

Why it is implemented?

It helps in centering the data as its values lies between -1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it. hence the convergence is faster. This makes learning for the next layer much easier.

x-axis(no. of epochs) , y-axis(error). It shows the convergence of the three activation functions

As you can see in figure below the derivative of tanh is steeper than sigmoid. This is because of the data in tanh is centered around 0. In sigmoid if we provide strongly negative inputs to it then the output value are very near to zero and it will get ‘stuck’ during training but in tanh it can map negative values between -1 to 0. So, the optimization is easier in this method therefore in practice it is always preferred over Sigmoid function.The tanh function has a derivative of up to 1.0, making the updates of W and b better than sigmoid.

for sigmoid : d(S(x))/dx = (1/(1+e^−x))-(1/(1+e^−x)²)
for tanh : d(f(x))/dx = 1 — [(e^x — e^-x) / (e^x + e^-x)]2

What are the problems in it?

Tanh also has the vanishing gradient problem.

The two circles show the saturation region of the gradient descent of Tanh.

RELU :-

Stands for Rectified linear unit. It is the most widely used activation function.

Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.

Value Range :- [0, inf)

Nature :- non-linear. We can easily backpropagate the errors because of the fact that the derivative of the positive region of the function(linear) is always constant.

Where it is implemented?

Chiefly implemented in hidden layers of Neural network due to sparsity.

Sparsity means most of the weights are 0. This can lead to an increase in space and time efficiency.

Why it is implemented?

Pink nodes are activated and contributes to the output. It shows sparsity in the network.

ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. At a time only a few nodes are activated making the network sparse making it efficient and easy for computation. In simple words, RELU learns much faster than sigmoid and Tanh function due to the fact that the convergence of gradient descent is faster when compared to the sigmoid/tanh functions. It is due to its linear and non-saturating form.

Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.

It was recently proved that it had 6 times improvement in convergence from Tanh function.

How Relu brings Non Linearity in Neural Network?

If we feed Relu output to other Relu units. This builds up a neural network. Below we show a simple network with two Relu units. The first Relu has weights w0, w1, and w2. The second Relu has weights u0, u1, u2. We then combine the two Relus by a linear model with weights v0, v1, v2 to get the final output.

linear_output = w0 + w1*x + w2*y                 VSrelu_output = linear_output > 0 ? linear_output : 0

Without the Relu activations, the final output is a linear model on top of linear models, which is also a linear model. Relu brings the difference by chopping the negative y-axis to bring non-linearity like the one shown below:

Single Linear(left) vs single Relu(right) activation function

Any function can be approximated with combinations of ReLu. From a group of linear functions we can derive a non linear function. It means we can draw any curve from multiple straight lines.

What are the problems in it?

It should only be used within Hidden layers of a Neural Network Model.

The derivatives for negative inputs are zero, which means for activation in that region, the weights are not updated during backpropagation. This can be handled by reducing the learning rate and bias. This is a form of the vanishing gradient problem. This is called dying ReLu problem. This problem typically arises when the learning rate is set too high.

ReLU output is not zero-centered and it does hurt the neural network performance. This could introduce undesirable zig-zag in the gradient descent updates for the weights. This can be handled by normalization technique like batch normalization. In batchnorm, these gradients are added up across a batch of data thus the final update for the weights can have variable signs, somewhat mitigating this issue.

Unbounded in the positive region of the x-axis. This means it can blow up the activation.

Application : Object Detection, Image Classification, Speech Recognition

--

--