Understanding Different Activation Functions

15 min readAug 15, 2023

In my previous post , we looked into significance of activation functions in neural networks and explored why nonlinearity is the secret ingredient that gives these networks their remarkable capabilities.In this post lets deep dive into various activation functions.

Let’s begin by exploring the key properties of Activation Functions

Nonlinearity: This is the most fundamental property, activation functions introduce nonlinearity into the network. This is important because real-world relationships and patterns are rarely linear. Nonlinear activation functions enable neural networks to capture complex structures and solve complex problems.
Range: Activation functions have specific output ranges that affect how information is propagated through the network. Some functions produce outputs between 0 and 1 (Sigmoid, Tanh), while others span from 0 to infinity (ReLU) or from -1 to 1 (Tanh).
Monotonicity: A monotonic activation function either strictly increases or strictly decreases as input values change. This property ensures that as inputs change, the neuron’s output moves in a consistent direction.
Continuity: A continuous activation function produces smooth and continuous changes in output as inputs change slightly. This property helps in smooth gradient computations during backpropagation.
Differentiability: Differentiability is essential for gradient-based optimization algorithms like backpropagation. Activation functions that are differentiable across their domain allow gradients to be computed for weight updates during training.
Sparsity: Some activation functions promote sparsity by having their outputs be zero for a large portion of input space. This can be beneficial in reducing the complexity of neural networks.
Computationally Efficient: When we’re dealing with deep networks, we want things to happen quickly. Activation functions that can be calculated easily, without needing complex math, are like shortcuts that speed up our network’s learning process. So, simpler calculations mean our neural network gets smarter faster.

Now, lets get into different activation functions and how they differ in practice.

Binary Step Function:

The Binary Step function is like a decision-maker and is the simplest activation function.
If the input is above a certain threshold, it outputs 1; otherwise, it outputs 0.

The Binary Step function can be mathematically represented as 

f(x) = 1, if x >= 0
f(x) = 0, if x < 0

import numpy as np
import matplotlib.pyplot as plt

def binary_step_function(x):
    if x >= 0:
        return 1
    else:
        return 0

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = [binary_step_function(x) for x in input_values]

# Plotting
plt.plot(input_values, output_values, label='Binary Step Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Binary Step Function')
plt.legend()
plt.grid()
plt.show()

While creating a binary classifier binary activation function are generally used. But, binary step function cannot be used in case of multiclass classification in target carriable.
The gradient of the binary step function is zero i.e if we calculate the derivative of f(x) with respect to x, it is equal to zero.
The Binary Step function can be use in very simple scenarios where decisions are binary, like basic logic gates or simple threshold-based classifiers.
But its inability to capture complex patterns make it less suitable to the real-world problems.

2. Linear Function:

The linear activation function is directly proportional to the input.
The main drawback of the binary step function was that it had zero gradient because there is no component of x in binary step function.
In order to remove that, linear function can be used.

Mathematically Linear Activation function is given by

F(x) = ax

Here, 'a' is a constant that determines the slope of the line.
If 'a' is positive, the line goes up as input increases; 
If 'a' is negative, the line goes down.

The derivative of the Linear activation function is not zero, but it’s just a constant value. This means that during the backpropagation step, which is how neural networks learn, the weights and biases get updated using the same constant value.
This doesn’t lead to much improvement because the because the neural network would not improve the error due to the same value of gradient for every iteration. Also, the network will not be able to identify complex patterns from the data.

No Learning Complexity:

Imagine you’re trying to learn from a book that only has one sentence, repeated over and over. You won’t gain much knowledge because there’s no complexity to learn from. Similarly, the Linear function’s constant gradient doesn’t help neural networks identify complex patterns in data. It’s like trying to understand a painting with only one color — you’ll miss all the details.

import numpy as np
import matplotlib.pyplot as plt

def linear_activation_function(x, c=1):
    return c * x

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = linear_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='Linear Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Linear Activation Function')
plt.legend()
plt.grid()
plt.show()

3. Sigmoid Activation Function:

The Sigmoid AF is sometimes referred to as the logistic function or squashing function .
The Sigmoid is a non-linear AF used mostly in feedforward neural networks.

Mathematical Expression:

The Sigmoid function is written as:

f(x) = 1 / (1 + e^(-x))

Here, 'e' is a special number (around 2.71828) and 'x' is the input value.

Sigmoid function is continuously differentiable and a smooth S-shaped function.
The derivative of the function is: f’(x) = 1-sigmoid(x)
It’s great for scenarios where you need to measure things in terms of probabilities. For instance, if you want to know the likelihood of a image containing a dog, Sigmoid gives you a number between 0 (not a dog) and 1 (definitely a dog).

Vanishing Gradient Problem:

Now, imagine you’re teaching a robot to dance. Every time it does a dance move, you give feedback to improve. But what if the robot starts moving like a snail whenever you say “nice job”? That’s what happens with Sigmoid. When inputs get really big or really small, the dance (or gradient) becomes so slow that learning almost stops. This is known as the “vanishing gradient” problem.

The vanishing gradient problem makes it a bit unsuitable for deep networks.However, its used when you need outputs between 0 and 1.

Not Zero Centered:

The Sigmoid activation function isn’t “zero-centered.” What this means is that even when the input is zero, the output isn’t centered around zero — it’s shifted to one side.

Imagine you’re training a neural network to recognize objects in images. If the Sigmoid’s output is always biased towards one direction, it might impact the network’s ability to distinguish between different objects, even when they’re similar.

Additionally, this absence of being centered around zero leads to another issue: the “dying” gradient problem. When gradients are multiplied layer by layer during training, they can become very small if the output is shifted away from zero.This makes it difficult for the network to learn effectively, can slow down the learning process of the network.

Modern activation functions, like the Rectified Linear Unit (ReLU), have become more popular due to their ability to address the challenges faced by Sigmoid.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid_activation_function(x):
    return 1 / (1 + np.exp(-x))

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = sigmoid_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Sigmoid Activation Function')
plt.legend()
plt.grid()
plt.show()

4. Tanh

The tanh function is also known as hyperbolic tangent function.
The Tanh function takes any input, big or small, and transforms it into a value between -1 and 1.

Mathematical Expression


Mathematically, the Tanh activation function can be expressed as:

f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Here, 'e' is Euler's number (around 2.71828), and 'x' is the input value.

It’s like Sigmoid function, but it has one advantage: it’s zero-centered. This means that when the input is 0, the output is also 0.This property can be beneficial when dealing with data that has a mean of zero, as it helps in the efficient learning of both positive and negative features.
However, like Sigmoid, Tanh also faces the “vanishing gradient” problem. As you travel deep into the network, the gradients become small, affecting learning. But because Tanh is zero-centered, it’s better at handling certain situations.
Tanh doesn’t work well with sparse data — data that mostly contains zeros and just a few non-zero values. Think of it like a bunch of tiny toy cars, some red and some blue. Imagine putting these cars through the Tanh .The small cars, whether a bit positive (+0.1) or a bit negative (-0.1), tend to get really close to zero.
Now, if those little cars actually carried important information then that can be a problem.It’s like misplacing these special cars among the others, making it harder for the neural network to notice and use them effectively.This becomes especially troublesome when you’re handling data that’s scattered and not closely grouped.

import numpy as np
import matplotlib.pyplot as plt

def tanh_activation_function(x):
    return np.tanh(x)

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = tanh_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='Tanh Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Tanh Activation Function')
plt.legend()
plt.grid()
plt.show()

5. ReLU

ReLU stands for rectified linear unit and is a non-linear activation function which is widely used in neural network.
It offers the better performance and generalization in deep learning compared to the Sigmoid and tanh activation functions .
It is a piecewise activation function that returns the input as the output if it is positive; otherwise, it returns 0.

Mathematical Expression:

The ReLU function can be shown as

f(x) = max(0, x)

Here, 'x' is the input value, and the function returns 'x' 
if 'x' is positive and returns 0 if 'x' is negative.

It eliminates the vanishing gradient problem observed in the earlier types of activation function by forcing the all the negative values to be 0.
ReLU is mostly used in the hidden parts of deep neural networks, along with other Activation Functions.
One really good thing about using ReLU is that it’s fast in calculations. It doesn’t need to do hard math like exponentials or divisions.
It’s even better than other functions because it doesn’t activate all the neurons at once. Instead, only some of them work together. Sometimes, the gradient is zero, and that’s a reason why weights and biases don’t change during training.
Even though ReLU is awesome, it has a small downside. It can try too hard and remember too much leading to overfitting, especially compared to the sigmoid function. To fix this,pleople use the Dropout technique.
It also suffers from a problem called “dying ReLU” where some neurons can become permanently inactive.

Imagine you have a group of friends who lose interest in a game and never want to play again. Similarly, ReLU has neurons that become inactive during training and stay that way, reducing the network’s effectiveness in learning. This is the “dying ReLU” problem.

To address this limitation, different versions of ReLU, such as Leaky ReLU, Parametric ReLU (PReLU), and Exponential Linear Units (ELU) were introduced.

import numpy as np
import matplotlib.pyplot as plt

def relu_activation_function(x):
    return np.maximum(0, x)

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = relu_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='ReLU Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('ReLU Activation Function')
plt.legend()
plt.grid()
plt.show()

6. Leaky ReLU:

Leaky ReLU works almost like ReLU. But, When a number is negative, instead of turning it completely off, it introduce some small negative slope to the ReLU.

Mathematical Expression:

The Leaky ReLU function can be shown as

f(x) = x if x > 0
f(x) = ax if x <= 0 (where a is a small positive constant)

Here, 'x' is the input value, and 'a' is a small positive number, like 0.01.

Unlike the regular ReLU, which encounters dead neuron issues, Leaky ReLU addresses the “dying ReLU” problem with the introduction of an alpha parameter. This parameter serves as a solution to prevent neurons from becoming inactive, ensuring that gradients are never completely zero during training.
The small alpha parameter in Leaky ReLU guarantees that even negative inputs contribute slightly to the output.
However, The value of the alpha(usually denoted by ‘α’) needs careful tuning. If it’s too small, it might not make a significant difference. If it’s too big, it could lead to a network that doesn’t learn well.
The performance of Leaky ReLU can vary depending on the dataset and the problem at hand. It might not always be the ideal choice for all scenarios. In some cases, using plain ReLU or other activation functions might yield better results.

import numpy as np
import matplotlib.pyplot as plt

def leaky_relu_activation_function(x, alpha=0.01):
    return np.maximum(alpha * x, x)

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = leaky_relu_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='Leaky ReLU Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Leaky ReLU Activation Function')
plt.legend()
plt.grid()
plt.show()

7. Parametric ReLU:

The parametric ReLU known as PReLU is another variant of the ReLU.
Unlike the regular ReLU, which has a fixed slope for negative values, Parametric ReLU (PReLU) enables the network to learn and adjust this slope using a parameter (‘a’).

Mathematical Expression

PReLU's mathematical expression looks like this

f(x) = x if x > 0
f(x) = ax if x <= 0 (where 'a' is a learnable parameter)

The 'x' is the input value, and 'a' is the parameter which is adjusted while training.

When the value of ‘a’ is initially set to 0.01, the function behaves similarly to the Leaky ReLU. However, in Parametric ReLU (PReLU), ‘a’ is not fixed; it’s a parameter that the network can adjust.
This adaptability allows the network to learn the most suitable value of ‘a’ during training, ensuring quicker and more effective convergence.
PReLU can be used when you want an activation function that’s adaptable and fine-tunes itself to your data.
PReLU is a great fit when your data has a mix of positive and negative values, and you need an activation function that handles both effectively.
It’s also beneficial if you’re aiming for faster and more efficient learning by enabling the network to learn the most suitable parameter value.

import numpy as np
import matplotlib.pyplot as plt

def prelu_activation_function(x, alpha=0.25):
    return np.where(x > 0, x, alpha * x)

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = prelu_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='Parametric ReLU Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Parametric ReLU Activation Function')
plt.legend()
plt.grid()
plt.show()

8.ELU

ELU handles both positive and negative inputs, unlike ReLU.

Mathematical Expression:

f(x) = x if x > 0 f(x) = α * (exp(x) - 1) if x ≤ 0

The 'x' is the input value, and 'a' is the parameter which is adjusted while training.

ELU prevents the “dying ReLU” problem by activating neurons even for negative values . This ensures that all neurons stay active and contribute to learning.
By including negative values, ELU allows the network to learn from a wider range of inputs. This can be advantageous in scenarios where different input patterns matter.
ELU is less sensitive to noise in the data. Learning from wider range of inputs, including both positive and negative values, enables ELU to distinguish between valuable information and noise. This contributes to improved stability and performance in various scenarios.
While ELU offers these benefits, its exponential calculations can be computationally more expensive compared to simpler functions like ReLU.
Additionally, determining the optimal ‘α’ value may require some experimentation, but it’s a necessary step to ensure ELU works optimally for your Neural Network and dataset.
ELUs provide quicker learning and better generalization than ReLU and Leaky ReLU, especially in networks with over five layers.
One limitation is that ELUs don’t center values at zero. To address this, the parametric ELU was introduced.

import numpy as np
import matplotlib.pyplot as plt

def elu_activation_function(x, alpha=0.5):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = elu_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='Exponential Linear Unit (ELU)')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Exponential Linear Unit (ELU)')
plt.legend()
plt.grid()
plt.show()

9.Swish

The Swish function is an activation function that was proposed by researchers at Google in 2017.
It is designed to combine the advantages of the Rectified Linear Unit (ReLU) and Sigmoid activation functions.

Mathematical Expression:

The Swish function is defined as follows

f(x) = x * sigmoid(x)

Here, 'x' is the input to the function, and 'sigmoid' represents the sigmoid activation function, which squashes input values between 0 and 1.

Swish is like a mix of two types of functions. It looks at the input number ‘x’. If ‘x’ is positive, it makes the number bigger in a smooth way. If ‘x’ is negative, it makes the number smaller, but not too much.
This mix of making things bigger and smaller gives Swish its special behavior that’s different from other functions
Swish offers smoothness similar to functions like sigmoid and tanh, contributing to stable learning during training.
It often gives more accurate results compared to other functions like ReLU.
For positive numbers, it acts like one thing (ReLU), and for negative ones, it behaves differently (like Sigmoid). This ability helps it handle different kinds of tasks and find different patterns in the data.
However, Swish comes with a few disadvantages:it’s computationally slower, sensitive to initial weights. Using the right techniques to set the initial weights is important for smooth and good learning.
Swish’s speed can change based on the computer it’s using. It might work fast on one computer but not as well on another. Thinking about the computer’s capabilities and performance is important when using Swish in real-life situations.

import numpy as np
import matplotlib.pyplot as plt

def swish_activation_function(x):
    return x * sigmoid_activation_function(x)

def sigmoid_activation_function(x):
    return 1 / (1 + np.exp(-x))

# Generate a range of input values
input_values = np.linspace(-10, 10, 100)
output_values = swish_activation_function(input_values)

# Plotting
plt.plot(input_values, output_values, label='Swish Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Swish Activation Function')
plt.legend()
plt.grid()
plt.show()

10. Softmax

The Softmax activation function is often used in the final layer of a neural network when dealing with multi-class classification problems.
It takes a set of numerical values as input and transforms them into probabilities that sum up to 1. This makes it suitable for predicting the probability of an input belonging to each class in a classification task.

The mathematical expression for the Softmax function for an input vector 'x' is as follows

softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all j

Where 'i' and 'j' iterate over the elements of the input vector 'x'.

The Softmax function takes a vector of real numbers as input and produces another vector of the same dimension as output.
For each input value ‘x’, the Softmax function calculates the exponent of ‘x’, and then it normalizes these exponentiated values by dividing them by the sum of all exponentiated values. This ensures that the output values represent probabilities.
In neural networks used for classification, the Softmax function is typically applied to the output layer.
It converts the raw scores or logits produced by the previous layers into probabilities. The class with the highest probability is then predicted as the final output class.
The primary strength of Softmax lies in its ability to handle multi-class classification problems where an input can belong to multiple mutually exclusive classes. By providing probabilities for each class, it captures the relationships between different classes and their interactions.
Softmax is best suited when you’re dealing with multi-class classification problems and need to assign probabilities to different classes.

Additionally, Softmax has its disadvantages:

It is sensitive to outliers in the input data. If the input values are extremely large or small, the exponential calculations involved in Softmax can lead to numerical instability and cause the function to produce unreliable results.
Softmax is sensitive to noisy or ambiguous training labels, leading to misclassification errors. This lack of robustness to label noise can affect the model’s performance and generalization capabilities.
It assumes that the classes are mutually exclusive and equally important. However, in scenarios with imbalanced class distributions, where some classes have significantly more samples than others, Softmax may not perform well. It can lead to biased predictions towards the majority class and result in poor performance for minority classes.

import numpy as np
import matplotlib.pyplot as plt

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# Example input values
input_values = np.linspace(-10, 10, 100)  # Creating a range of input values

# Applying Softmax to the input values
softmax_output = softmax(input_values)

# Plotting the Softmax output
plt.figure(figsize=(8, 5))
plt.plot(input_values, softmax_output, label="Softmax Output")
plt.xlabel("Input Values")
plt.ylabel("Probability")
plt.title("Softmax Activation Function")
plt.legend()
plt.grid(True)
plt.show()

Now, that we have understood different activation functions , their advantages and limitations its important to note that there’s no universal activation function that suits every problem.

Activation functions are essential components in machine learning, each with its own distinct characteristics. However, the effectiveness of these functions often relies on the context, data, and specific challenges. It’s not about finding the single “best” function, but rather understanding when and how to use them effectively.

As you explore different activation functions, keep in mind that trying out various options is essential for constructing your models, acquiring insights, and successfully addressing real-world challenges.

Understanding Different Activation Functions

Written by Dedeepya Lekkala