NEURAL NETWORK - ACTIVATION FUNCTION

arshad alisha sd
Analytics Vidhya
Published in
6 min readAug 30, 2020

--

In neural network activation function are used to determine the output of that neural network.This type of functions are attached to each neuron and determine whether that neuron should activate or not, based on each neuron’s input is relevant for the model’s prediction or not.In this article we going to learn different types of activation function and their advantages ,disadvantages.

Before going to study activation function let’s see how activation function works.

fig1.Activation function

we know that each neuron contain activation function .It takes input as summation of product of outputs of previous layer with respective their weights.This summation value is passed to activation function.

commonly used activation function

1. Sigmoid Function:-

Sigmoid function is a one of the most popular activation function.Equation of sigmoid function reprsented as

sigmoid function always gives output in range of (0,1) .The derivative of sigmoid function is f`(x) =f(x)(1-f(x)) and it’s range between (0,0.25).

Generally sigmoid function is used in end layers.

sigmoid activation function

Adavantages

1 Smooth gradient, preventing “jumps” in output values.

2 Output values bound between 0 and 1, normalizing the output of each neuron.

Disadavantages

1 Not a zero centric function.

2 Suffers with gradient vanishing.

3 Output of values which are far away from centroid is close to zero.

4 Computationally expensive because it has to calculate exponential value in function.

2 Tanh or Hyperbolic tangent activation function:-

To overcome disadvantage of non-zero centric function of sigmoid function people introduced tanh activation function.Tanh activation function equation and graph represented as

Tanh activation function

the output of Tanh activation function always lies between (-1,1) and it’s derivative lies between (0,1)

Advantages

Tanh function have all advantages of sigmoid function and it also a zero centric function.

disadvantages

1 more computation expensive than sigmoid function.

2 suffers with gradient vanishing.

3 output of values which are far away from centroid is close to zero.

3 RELU(Rectified Linear Unit):-

In above two activation we have major problem with gradient vanishing to overcome this problem people introduced relu activation function.

Relu activation function is simple f(x) = max(0,x) .which means if x(input value) is positive then output also x.If x(input value) is negative then output value is zero,which means that particular neuron is deactivated.

RELU activation function

Advantages

1 No gradient vanishing

2 Derivative is constant

3 Less computation expensive

Disadvantages

1 No matter what for negative values neuron is completely inactive.

2 Non zero centric function.

4 Leaky RELU

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so).

leaky relu

In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.

5 ELU (Exponential Linear Unit):-

Elu activation function is also introduced to overcome of dead neutrons.Instead of multiple with 0.01 when x<0 ,a ELU do α.( ex — 1).

ELU activation function

Advantages

1 No Dead ReLU issues.

2 The mean of the output is close to 0, zero-centered

Disadvantages

One small problem is that it is slightly more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.

6 PRELU (parametric Relu):-

PReLU is also an improved version of ReLU. In the negative region, PReLU has a small slope, which can also avoid the problem of ReLU death. Compared to ELU, PReLU is a linear operation in the negative region. Although the slope is small, it does not tend to 0, which is a certain advantage.

PRELU

We look at the formula of PReLU. The parameter α is generally a number between 0 and 1, and it is generally relatively small, such as a few zeros. When α = 0.01, we call PReLU as Leaky Relu , it is regarded as a special case PReLU it.

Above, yᵢ is any input on the ith channel and aᵢ is the negative slope which is a learnable parameter.

  • if aᵢ=0, f becomes ReLU
  • if aᵢ>0, f becomes leaky ReLU
  • if aᵢ is a learnable parameter, f becomes PReLU

7 swish(A slef gated ) function:-

Swish function is proposed by google brains which simply multiplying input with sigmoid function i.e.,f(x) = x .sigmod(x).

Google brains experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets.

The advantage of self-gating is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters.

8 Softmax (Normalized Exponential function):-

The softmax activation function is used in neural networks when we want to build a multi-class classifier.The equation of soft max function is

softmax activation function

Softmax. It always “returns a probability distribution over the target classes in a multiclass classification problem”.

example

For instance if you have three classes[A,B,C], there would be three neurons in the output layer. Suppose you got the output from the neurons as [-0.21 , 0.47 , 1.72].Applying the softmax function over these values, you will get the following result — [0.1 , 0.2, 0.7]. These represent the probability for the data point belonging to each class. From result we can that the input belong to class C.

9 Softplus activation function:-

The softplus function is similar to the ReLU function, but it is relatively smooth.It is unilateral suppression like ReLU.It has a wide acceptance range (0, + inf).

RELU vs SOFTPLUS

Softplus function: f(x) = ln(1+exp x).The derivative of softplus is f ′(x)=exp(x) / ( 1+exp⁡ x ) = 1/ (1 +exp(−x )) which is also called the logistic function.

10 MaxOut Activation Function

The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout regularization technique.

MAXOUT

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

NOTE: Generally speaking, these activation functions have their own advantages and disadvantages. There is no statement that indicates which ones are not working, and which activation functions are good. All the good and bad must be obtained by experiments.

--

--