Activation Functions used in Neural Networks

Rina Mondal
5 min readDec 22, 2023

--

Activation Functions play a crucial role in neural networks by introducing non-linearities, allowing neural networks to learn complex relationships and patterns. In ANN, each neuron forms a weighted sum of its inputs and passes the resulting scalar value through a function referred to as an activation function or transfer function.

Now, Let’s understand the criteria of Ideal Activation Functions:

  1. Non-linear so that it can capture the pattern of non-linearity.
  2. Differentiable.
  3. Computationally inexpensive.
  4. Zero centered or normalized.
  5. Non-saturating.

Here are some commonly used activation functions:

Activation Functions used in Neural Networks

1. Sigmoid Function (Logistic):

The sigmoid function has an S-shaped curve and is defined by the formula:

sigma(x) = 1/(1 + e^(-x))

Here:
sigma(x) is the output of the sigmoid function.
- x is the input to the function.
- e is the base of the natural logarithm (approximately 2.71828).

The sigmoid function maps any real-valued number to the range between 0 and 1. As x approaches positive infinity, sigma(x) approaches 1, and as x approaches negative infinity, sigma(x) approaches 0. The midpoint of the curve, where sigma(0) = ½ at x = 0.

Advantage:

i. It is used in the output layer of binary classification models where the goal is to produce a probability that an input belongs to a particular class.

ii. It can capture the non-linearity in the data.

ii. It squashes the input values to the desired range, making it suitable for tasks where a binary decision is needed.

iii. It is differentiable.

Drawback:

i. It’s output saturates (becomes very close to 0 or 1) for extreme input values, which can lead to the “vanishing gradient” problem during backpropagation in deep neural networks.

ii. This is non zero- centered.

iii. It is computationally expensive.

It is generally observed that Sigmoid function is used in the last layer only for binary classification.

2. Hyperbolic Tangent Function (tanh):

The hyperbolic tangent function (tanh) is a rescaled version of the hyperbolic sine function and is defined by the formula:

tanh(x) = (e^(2x)— 1)/(e^(2x) + 1)

Here, tanh(x) is the output of the tanh function.
x is the input to the function.
e is the base of the natural logarithm (approximately 2.71828).

The tanh function maps any real-valued number to the range between -1 and 1. As x approaches positive infinity, tanh(x) approaches 1, and as x approaches negative infinity, tanh(x) approaches -1. The midpoint of the curve, where tanh(0) = 0, is at x = 0.

Advantage:

  1. It can capture the non-linearity.
  2. It is differentiable.
  3. It is zero-centered and faster in comparison with sigmoid.

Drawback:

It is also a saturating function like Sigmoid. It also suffers from the vanishing gradient problem for extremely large or small inputs.

3. Rectified Linear Unit (ReLU):

ReLU, which stands for Rectified Linear Unit, is a popular activation function used in artificial neural networks. The ReLU function is defined as: ReLU(x) = max(0, x)

In other words, if the input x is positive, the output is equal to x, and if the input is negative, the output is zero. The function introduces non-linearity to the model and is computationally efficient.

Advantage:

  1. The ReLU activation is simple and computationally efficient, involving only a thresholding operation. Hence, converges much faster.

2. It does not saturate in positive region like Sigmoid and tanh.

3. ReLU can lead to sparsity in the network because some neurons can output zero for certain inputs, effectively reducing the number of active neurons.

Drawback:

  1. It is not completely differentiable. At zero, it can not be differentiated.
  2. It is non zero-centered.
  3. Neurons with a ReLU activation can sometimes become “dead” during training, meaning they always output zero for any input. This occurs if a large gradient flows through a ReLU neuron, causing the weights to update in such a way that the neuron will always output zero. It is known as Dying ReLU problem.

To address this, variants like Leaky ReLU and Parametric ReLU have been proposed.

Popular variants of ReLU include:

  1. Leaky ReLU: where alpha (α) is a small positive constant, typically close to zero instead of zero.

Advantage:

  • Non-saturated.
  • Easily Computed.
  • No Dying ReLU Problem.
  • Close to Zero-centered.

2. Parametric ReLU (PReLU):
f ( x ) = max (αx, x )
where alpha is a learnable parameter.

3. Exponential Linear Unit (ELU):

ELU (x) = x if x > 0, ELU (x)=α(e^x — 1) if x< = 0

  • Smooth version of ReLU that allows negative values with a small penalty.

4. Scaled Exponential Linear Unit (SELU):

SELU (x) = lambda * x if x > 0, ELU (x)=lambda* α(e^x — 1) if x< = 0.

lambda and alpha both have fixed value.

4. Softmax Classifier:

It takes as input a vector of real numbers and transforms them into a probability distribution over multiple classes.

The Softmax function is defined as follows:

The Softmax function exponentiates each element of the input vector and normalizes the results by the sum of all exponentiated values. This normalization ensures that the output vector sums to 1, making it interpretable as probabilities.

Advantages:

1. The output of the softmax function represents a probability distribution over the classes. Each element of the output vector indicates the probability of the input belonging to the corresponding class.

2. The softmax function is sensitive to the magnitudes of the input values. Larger input values result in larger exponentiated values, leading to higher probabilities.

3. The softmax function is differentiable, which is crucial for training neural networks using gradient-based optimization algorithms.

4. The softmax function is often used in the final layer of a neural network for tasks involving multiple classes, such as image classification or natural language processing. It helps convert the raw output scores (logits) into probabilities, facilitating decision-making in multi-class scenarios.

These activation functions serve different purposes and may be suitable for specific types of problems or network architectures. The choice of activation function depends on factors such as the nature of the problem, the type of data, and the characteristics of the network being used.

Explore Complete Data Science Roadmap.

Visit my YouTube Channel where I explain Data Science related topics for free.

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇. If you appreciate my hard work please follow me. That is the only way I can continue my passion.

--

--

Rina Mondal

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.