Activation function -Simplified Mathematics

Eshant Sah
6 min readJan 1, 2019

--

Activation function in Deep learning decides which neuron should be activated at a given point of time.

Now let's understand this by the example of our Best friend i.e “Human brain”

A human brain is fed with a lot of information, out of which only a small fraction of it is useful, So our brain tries to understand and classify them into productive and non- productive information.

By using the same analogy, we need some mechanism that can help our neural networks do this classification and only grasp useful information.And this mechanism is called Activation Function in a neural network.

Some prerequisite before we understand this interesting topic —

Important Note1: “Activation function should be differential”.

Why so??

Differentiation(Gradient) is performed during Backpropogation.So Lets understand Backpropogation first—

once the feedback(actual - predicted) is obtained,this feedback is propogated backwards to neural network to compute gradients of Error(loss) with respect to Weights and then accordingly optimize weights using Gradient or any other Optimization technique to reduce Error which is called as 'training the model or Learning'.

Important Note1: “Activation function should be Non-Linear”.

why so?

Non-linearity helps us easily backpropagate the errors. And this will enhance the learning of our model.

Now its time to deep-dive in the “pool of Mathematics” .

Types of Activation function:

  1. Binary Step
  2. Linear Function
  3. Sigmoid
  4. Tanh
  5. ReLU
  6. Leaky ReLU
  7. Softmax

1)Binary Step function:

Mathematically-
f(x) = 1,
where x>=0

In a simple way: If the value f(x) is above a given threshold value then activate the neuron else leave it deactivated.

fig: graph for f(x) = 1

So what's the problem with this model ??

If we differentiate linear function to bring non-linearity, function will become Zero.

Derivative of f(x) = 1 is:

f'(x) = 0, for all x

Hence zero(0) is propogated backwards as a feedback which means no activation of any neuron and no Learning.

“Don’t worry we have many Activation functions”

2)Linear Function:

Mathematically-
f(x)=cx,where c is constant

In a simple way: let us consider an example-

f(x) = 5x , here value of C=5 in f(x)=cx

so when we take gradient it becomes-

f'(x) = 5

So the problem is-

If we differentiate linear function to bring non-linearity, result will no more depend on “x” and function will become constant, here f’(x) = 5

Graphically

fig: graph for f(x) =5x

3)Sigmoid: Sigmoid is a widely used activation function, it is a non-linear function which means we can backpropagate the errors and activate multiple neurons.

Mathematically-f(x)=1/(1+e^-x)
output of this function ranges from 0 to 1 only.

In a simple way: let us take its gradient

f'(x) = e^-x/(1+e^-x)^2

This shows that the derivative has some ‘x’ , hence it’s smooth and is dependent on x, and it will activate ‘x’ neurons which will make our neural network better in every iteration.

This is used in output layer of a binary classification, where result is either 0 or 1, result can be predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.

So what's the problem with this function:

The function is flat beyond the +4 and -4 region. This means that once the function falls in that region the gradients become very small. This means that the gradient is approaching zero and the network is not really learning.

Another problem is that the output values only range from 0 to 1. This meas that the sigmoid function is not symmetric around the origin and the values received are all positive.

4)Tanh: The tanh( Tangent Hyperbolic function) function is very similar to the sigmoid function. It is actually just a scaled version of the sigmoid function, it is also a non-linear function which means we can backpropagate the errors and activate multiple neurons.

Mathematically-
f(x) = tanh(x) = 2/(1 + e^-2x) - 1
OR
tanh(x) = 2 * sigmoid(2x) - 1

In Simple Terms:

it’s values lies between-1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in centering the data by bringing mean close to 0. This makes learning for the next layer much easier.

Graphically:

Derivative -
f'(x) = 1 - tanh^2(x)

So what's the problem with this :

The gradient of the tanh function is preety steeper as compared to the sigmoid function. But similar to the sigmoid function we still have the vanishing gradient problem beyond +2,-2 region. The graph of the tanh function is flat and the gradients are very low.

5)ReLU: stands for Rectified Linear Unit. It is the most widely used . Mostly implemented in hidden layers of Neural network.

Mathematically:-
f(x)=max(0,x)

The main advantage of using the ReLU function over other activation functions is “that it does not activate all the neurons at the same time”.

What does this resemble ?

So if the input to the function is negative it will convert it to zero and and the positive value remains unchnaged.. And only a few neurons are activated making the network sparse making it efficient and easy for computation.

Graphically:-

Derivative:-
f'(x)= 1 ,x>=0
= 0 ,x<0

so finally Does this function has some problem??

The answer is YES

But ReLU also has problem of gradients approaching zero. If We look at the negative side of the graph, the gradient is zero(blue colored region in -ve X axis), the gradient is zero and the weights are not updated during back propagation. So neuron never gets activated.

ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. And only few neurons are activated making it efficient and easy for computation.

6)Leaky ReLU: It is an improved version of the ReLU function. Also known as Parameterised ReLU function.

Mathematically:-
f(x)= ax, x<0
'a' is also a trainable parameter.convergence.
= x, x>=0

In a simple way:

In ReLU function, the gradient is 0 for x<0, which made the neurons die for activations in -ve X axis region.

Leaky ReLU is defined to solve this problem. Instead of defining the Relu function as 0 for x less than 0, we define it as a small linear component of x.

ReLU vs Leaky ReLU

so the derivative is

f'(x)=a ,x<0
Here a = 0.01 in derivative of Leaky ReLU

Hence, in this case, the gradient of the left side of the graph is non zero(some ‘a’ value)and there would be no dead neurons in that region.

7)Softmax: The softmax function is also a type of sigmoid function but used to handle classification problems. The sigmoid function was able to handle just two classes like the patient has cancer or not. Softmax function is used to handle multiple classes.

In a simple way:

consider this example
[1,0,0] #cat
[0,1,0] #dog
[0,0,1] #bird

When we apply the softmax function we would get [0.7, 0.2, 0.1].
The output probabilities are saying 70% sure it is a cat, 20% a dog, 10% a bird.So now we can use these as probabilities for the value to be in each class.

Code Time:

# Initialising the ANN by creating object of sequential.
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

--

--