Top Activation Functions at a glance

Anushka Bajpai
13 min readJun 9, 2022

--

Universal Approximation Theorem states that a neural network can learn any continuous function.

Question is — what makes it do so?

Answer — the non-linearity induced by activation functions

Source

Activation functions act as a mathematical “gate” in between the input, feeding the current neuron and its output going to the next layer. They decide how, when and whether the neuron should be activated or not.

PROPERTIES

There are a certain set of properties which an activation function needs to have in order to be able to serve it’s right purpose.

Let’s cover them in details. . .

1. Non Linearity

The purpose of the activation function is to induce non-linearity into the network in order to model a response variable (or target variable)that varies non-linearly with its explanatory variables (or independent variables).

Output of a single Perceptron— w.x+b

where x is the input, w is the weight and b is the bias

Usually, by convention we assume that a certain set of parameters (w, b) is a valid separator if y(w.x+b) > 0 for all the data points and not a valid separator if y(w.x+b) < 0 for any of the data points.

If we do not have the activation function the weights and bias would simply do a linear transformation. A linear equation, although simple to solve, is limited in its capacity to solve complex problems and have less power to learn complex functional mappings from data.

So, a neural network without an activation function is just a linear regression model with limited functionality.

So, another way to think of it — without a non-linear activation function in the neural network, no matter how many layers it had, it will behave just like a single-layer perceptron, because summing these layers will just give us another linear function

In many cases, where the output does not maintain a linear relationship along with it’s independent variables, a single perceptron is likely to fail as it works on the principle of linearity and is itself, nothing but a line (line of separation)

In such a case, we need an activation function to capture the non linear relationship!

Source

2. Continuously differentiable

Activation function should be differentiable.

This property is necessary for enabling gradient-based optimization methods.

Why the activation functions need to differentiable?

As we know, using the backpropagation the networks calculate the errors it’ made previously and using this information, it updates the weights accordingly to reduce the overall network error.

To perform this the network uses the gradient descent approach which needs the differential of the activation functions.

3. Zero Centered (approximates identity near the origin)

The output of the activation functions needs to be zero centered, So this will help in the calculated gradients to be in the same direction and shifting across.

When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values.

When the activation function lack this, special care should be taken while initializing the weights.

4. Range & Saturation

When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights.

e.g : tanh and sigmoid squeeze their outputs in the range (0,1) and (-1,1) respectively and are hence saturating in nature and they result in vanishing gradient issue.

When the range is infinite, training is generally more efficient as pattern presentations drastically affect majority of the weights.

5. Monotonic

Monotonic Functions : Either entirely increasing or decreasing

When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex, i.e updates made to the weights are likely to have the right impact on reducing the error of cost function

TYPES

1. SIGMOID

suitable for classification problems

Ranges between : 0 to 1

Derivative ranges between : 0 to 0.25

Suitable for : Classification problems where we need to find the probability/make predictions (as probability value ranges between 0 to 1).

Usually used in the final output layer where prediction/classification needs to be made

Source

Advantages of Sigmoid Function : —

  1. Smooth gradient, prevents “jumps” in output values.
  2. Output values are bound between 0 and 1, normalizing the output of each neuron and can be used as probability values for binary classification (i.e, clear prediction)
  3. Continuously differentiable and non linear (capable of working on non linear data distribution)
  4. Monotonic in nature (either increasing or decreasing in their entire domain) — it’s derivative however isn’t monotonic.

Sigmoid has three major disadvantages : —

  1. Prone to gradient vanishing (gradient saturation or gradient dispersion)— As sigmoid is a saturating function (restricted between 0 to 1), when the input is slightly away from the coordinate origin, the gradient of the function becomes very small (close to zero). During backpropagation, when gradient passes through sigmoid function, the differential becomes negligibly small (after passing through many sigmoid functions), as a result of which, the weight will have e little or no effect on the loss function.

2. Function output is not zero-centered (i.e mean is not zero), which may reduce the efficiency of weight update ans take longer time to converge or reach the global minima.

3. The sigmoid function performs exponential operations, which is slower for computers, hence time consuming and computationally expensive.

These days, Sigmoid is rarely used in hidden layers as we have multiple better options available. It is only preferred in output layer for a binary classification problem.

2. TANH (hyperbolic tangent)

Ranges between : -1 to 1

Derivative ranges between : 0 to 1

Suitable for : Classification problems In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer.

Usually used in the final output layer where prediction/classification needs to be made.

Tanh is a hyperbolic tangent function. The curves of tanh function and sigmoid function are relatively similar (S-shaped). First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval.

Advantages

  1. The function is differentiable and non linear

2. The function is monotonic while its derivative is not monotonic.

3. The output interval of tanh is 1, and the whole function is 0-centric, which is better than sigmoid.

Both tanh and logistic sigmoid activation functions can be used in feed-forward neural networks.

Disadvantages

  1. more computation expensive than sigmoid function.
  2. suffers with gradient vanishing as it is also a saturating function.
  3. output of values which are far away from centroid is close to zero.

It only solves one issue of sigmoid function — i.e slow training due to non-centric nature around zero.

3. RELU (and it’s Variants)

suitable for regression problems

Ranges between : 0 to infinity

Derivative ranges between : 0 or 1 (not 0 to 1) ..as f(x)={0 if x<0, 1 if x>0}

Suitable for : Classification problems where we need to find the probability/make predictions (as probability value ranges between 0 to 1).

Usually used in the final output layer where prediction/classification needs to be made.

The ReLU function always takes the maximum value. It is not fully interval-derivable, but we can take sub-gradient, as shown in the figure above. Although ReLU is simple to implement, it usually is not advisable for output layers.

Advantages

1) Solves the problem of gradient vanishing or gradient saturation (when the input is positive — as it’s derivative value becomes either 1 or 0)

2) The calculation speed is much faster (converges faster). The ReLU function has only a linear relationship. Whether it is forward or backward, it is much faster than sigmoid and tanh (sigmoid and tanh need to calculate the exponent, which will be slower)

3) Non linear and non-saturating in nature (for positive region)

Disadvantages

1) When the input is negative, ReLU becomes totally inactive and suffers from dead activation problem (dies out). So during the backpropagation process, if we have a negative weight, the gradient will completely become zero (same as the sigmoid function and tanh functions)

2) We find that the output of the ReLU function is either 0 or a positive number, hence ReLU is not a 0-centric function (just like sigmoid) — Batch normalization can solve this problem.

3) Not easy to differentiate

RELU is one of the most widely used and best activation functions available at present

In order to deal with the DEAD NEURON problem, we have some variants of RELU. . .

3 i) LEAKY RELU

LReLU(x) = {x, if x>0

αx, if x≤0}

Ranges between : -infinity to infinity

Derivative ranges between : α to 1

The leak helps to increase the range of the ReLU function. Usually, the value of alpha is between 0.01–0.03.

When it is not 0.01 then it is called Randomized ReLU.

Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives also monotonic in nature.

Ranges between : -1 to 0.01x

Derivative ranges between : 0 to 1

In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x instead of 0.

Another intuitive idea is a parameter-based method, Parametric ReLU : f(x)= max(alpha x, x), which alpha can be learned from back propagation.

In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.

Advantages

  1. Solves the dying RELU problem
  2. Faster in performance

Disadvantages

  1. It may suffer from vanishing gradient problem in case we have multiple negative weights, 0.01 will get multiplied several times and result in a very small value
  2. In actual operation, it is not guaranteed to perform better than RELU.

3 ii) ELU

ELU is also proposed to solve the problems of ReLU.

It is same as RELU for x >0

ELU has all the advantages of ReLU, plus:

  • No Dead ReLU issues
  • The mean of the output is close to 0, zero-centered

Disadvantages

Slightly more computationally intensive. Similar to Leaky ReLU

Theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.

3 iii) Pre RELU (Parametric RELU)

Alternatively, instead of using 0.01 in Leaky RELU, Pre RELU uses parameter, α, which is then learned during training along with the weights.

Leaky vs Pre RELU (Source)

3 iv) Swish RELU (self gated function)

The formula is : y = x * sigmoid (x)

Swish’s design was inspired by the use of sigmoid functions for gating in LSTMs and highway networks. We use the same value for gating to simplify the gating mechanism, which is called self-gating.

The advantage of self-gating is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters.

1) Unboundednessis helpful to prevent gradient from gradually approaching 0 during slow training, causing gradient saturation. Also, bounded active functions can have strong regularization & helps resolving larger negative inputs.

2) At the same time, smoothness also plays a very crucial role in optimization and generalization.

4. MAXOUT

— takes the maximum value among the values from “n linear functions

Here, the number of linear functions ( pieces ) is determined beforehand. Approximating a function using multiple linear functions is known as piece-wise linear approximation.

Maxout neuron (introduced by Goodfellow et al.)generalizes the ReLU and its leaky version. Both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have w1, b1 =0). It is a learnable activation function.

The Maxout activation is a generalization of the ReLU and the leaky ReLU functions and hence incorporates all the advantages of RELU while eliminating the dying RELU issue.

Maxout can be interpreted as adding a layer of activation function to the deep learning network, which contains a parameter k. Compared with ReLU, sigmoid, etc., this layer is special as it adds k neurons and then outputs the largest activation value.

Here is a detailed paper on Maxout Functions.

5. Softplus

The softplus function is similar to the ReLU function, but it is relatively smooth(near zero).

Range : (0, + inf).

Softplus function: f(x) = ln(1+exp x)

Derivative of Softplus : —

It’s more complex in computation as compared to RELU

Source

Here is a detailed article on Softplus and softminus functions.

ACTIVATION FUNCTIONS AT A GLANCE

Source
Source
Source

It isn’t easy to recommend an activation function that works for all use cases. There are many considerations —how accurate will it be for a given set of data, how difficult it is to compute it’s derivative (if it is differentiable at the first place), how quickly will it make the network converge, how smooth it is, whether it satisfies the universal approximation theorem, whether it preserves normalization, and many more.

How can we pick an activation function then?

There are pros and cons related to each activation function, as we already covered and hence we need to be careful while picking one. Listing below, some common points to keep in mind when it comes to choosing an activation function :-

  • Sigmoid functions (softmax too) and their combinations generally work better in the case of classification problems.
  • Sigmoid and tanh functions are still avoided in hidden layers due to the vanishing gradient problem.
  • Tanh is avoided most of the time due to dead neuron problem.
  • ReLU activation function is widely used and is a default choice as it yields better results (than sigmoid and tanh).
  • ReLU function however should only be used in the hidden layers (and never in the output layer)
  • An output layer can be linear activation function in case of regression problems but needs to be non linear for classification tasks.
  • If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice.
  • ReLU activation function is currently the most commonly used function for the hidden layers (and never for the output layer) for any type of neural network.
  • Although the swish activation function does not prove to outperform the ReLU function on complex applications, it should be mostly used for only larger neural networks having depths of greater than 50 layers.
  • For binary classification applications, the output (top-most) layer should be activated by the sigmoid function — also for multi-label classification.
  • For multi-class applications, the output layer must be activated by the softmax activation function.
  • The linear activation function should only be used in the output layer of a simple regression neural network.
  • For recurrent neural networks (RNNs) the tanh activation function is preferred for the hidden layer(s). It is set by default in TensorFlow.
  • If the ReLU fails provide the best results, changing to leaky ReLU might in some cases yield better results and overall performance.
Source

Leaky Rectified Linear Unit, or Leaky ReLU, is based on RELU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we we may suffer from sparse gradients, for example training generative adversarial networks.

For Tensorflow activation function implementations, feel free to check out this module from — Tensorflow’s official website.

Final Thoughts

So now that we know how activation functions are an important factor in a neural network which decide whether or not a neuron will be activated or not and transferred to the next layer. This simply means that it will decide whether the neuron’s input to the network is relevant or not in the process of prediction. For this reason, it is also referred to as threshold or transformation for the neurons which can converge the network.

Here are some detailed articles on the same :

Fundamentals of Deep Learning — Activation Functions and When to Use Them?

Deep Learning: Which Loss and Activation Functions should I use?

Activation Functions Explained — GELU, SELU, ELU, ReLU and more

Happy Activation!!

Source : V7 Labs

REFERENCES

--

--