All about Activation functions & Choosing the Right Activation Function

18 min readNov 9, 2023

It is applied to the weighted sum of inputs to the neuron (including the bias term), and the result becomes the output of the neuron that is then passed to the next layer
It is used to introduce non-linearity into the model.
Allowing the network to learn complex patterns and make it capable of approximating any arbitrary function.

Step Activation function:

The step function, also known as the Heaviside step function, is a mathematical function that takes on one of two constant values. The step function, often denoted as H(x), H(x) equals 0 for x < 0 and 1 for x ≥ 0. In other words, it “steps” from 0 to 1 at x = 0.

H(x) = { 0 for x < 0, 1 for x ≥ 0 }

Graph:

In the graph, the function has a value of 0 for all x values less than 0 and a value of 1 for x values greater than or equal to 0. The transition between 0 and 1 occurs instantaneously at x = 0, creating a step.

Advantages:

Simplicity: The step function is straightforward and easy to understand.
Binary Output: It provides a binary output, making it suitable for problems where you want to make a clear distinction between two classes or states.

Disadvantages:

Lack of Continuity: The step function is not continuous at x = 0. This discontinuity can make it challenging to work with in some mathematical contexts.
Non-Differentiability: The step function is non-differentiable at x = 0, which can be a problem in optimization algorithms that rely on derivatives.
Not Suitable for Gradient-Based Learning: In the context of neural networks and machine learning, the step function’s non-differentiability makes it unsuitable for gradient-based learning algorithms like backpropagation.

Code:

def step_function(x):
    return np.where(x > 0, 1, 0)

Sigmoid Activation Function:

The sigmoid activation function, also known as the logistic function, is a classic non-linear activation function used in artificial neural networks.

The sigmoid function is defined as:

σ(x) = 1 / (1 + e^(-x))

Here, σ(x) represents the output of the sigmoid function for a given input x.
The function maps any real number to a value between 0 and 1.

Graph:

The sigmoid function produces an S-shaped curve. It starts at zero, rises slowly from -∞ to ∞, and approaches 1 as the input becomes large (positive or negative).

Derivative of Sigmoid: In this graph, you can see that the derivative values are in the range [0, 0.25], with a maximum value of approximately 0.25 occurring at the midpoint.

This bell-shaped curve is typical of the sigmoid function’s derivative. The derivative is highest (steepest) around the center (x=0), which is useful in gradient-based optimization algorithms like backpropagation during neural network training

Use Cases:

Binary Classification: Sigmoid is commonly used in the output layer of binary classification models. It squashes the network’s raw output to a probability-like value between 0 and 1, which can be interpreted as the probability of belonging to one of the two classes.

Advantages:

Smooth Gradient: The sigmoid function provides a smooth gradient, which makes it suitable for gradient-based optimization algorithms like gradient descent. This smoothness leads to more stable convergence during training.
Output Range: The output of the sigmoid function is bounded between 0 and 1, which is useful in the context of probabilities.

Disadvantages:

Vanishing Gradients: Sigmoid activation functions are prone to the vanishing gradient problem. For very positive or very negative inputs, the gradient becomes extremely small, causing slow convergence and making it harder for deep networks to train effectively.
Output Not Centered at Zero: The output of the sigmoid function is not centered at zero, which can slow down learning in some cases. It may lead to vanishing gradient descent or the outputs of hidden neurons in subsequent layers are influenced by the outputs of neurons in previous layers, which can become biased towards either 0 or 1.
Not Sparse: Sigmoid activations are not sparse; they always produce some activation regardless of the input, which may not be efficient for certain tasks.

Due to the vanishing gradient issue and the availability of better activation functions like ReLU and its variants, sigmoid functions are less commonly used in hidden layers of deep neural networks today. However, they are still relevant in output layers for binary classification tasks.

Code:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Tanh Activation Function:

The hyperbolic tangent, often abbreviated as tanh, is another popular activation function used in artificial neural networks. It is similar to the sigmoid function but has some advantages, particularly in deep learning models.

Mathematical Formula: The tanh function is defined as:

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

The tanh function, like the sigmoid, also squashes input values to a range between -1 and 1.
It is an S-shaped curve that is symmetric around the origin (0,0). This means that the tanh function outputs negative values for negative inputs and positive values for positive inputs, ranging between -1 and 1.
The function is centered at 0, which means its mean is zero. This helps address the issue of vanishing gradients in deep networks better than the sigmoid, as it is zero-centered.

Graph:

The tanh function has a similar S-shaped curve as the sigmoid but is centered at 0. It has steeper slopes near the origin compared to the sigmoid, which allows it to capture stronger gradients during backpropagation.

Derivative of tanh :

Symmetry: Similar to the tanh function itself, its derivative is symmetric with respect to the y-axis. This means that tanh’(x) is an even function, and its values are the same for positive and negative inputs.
Range: The range of tanh’(x) is between 0 and 1. As ‘x’ varies from negative infinity to positive infinity, the derivative varies between 0 (for large negative values of ‘x’) and 1 (for large positive values of ‘x’).
Critical Points: The critical points of tanh’(x) are at x = 0, where the derivative reaches its maximum value of 1, and at x = ±∞, where the derivative approaches 0.
Sigmoid Shape: Similar to the tanh function, the derivative of tanh has an S-shaped or sigmoidal curve. It rises gradually from 0, reaches a maximum slope of 1 at x = 0, and then gradually approaches 0 as |x| becomes very large.
Gradient Near the Origin: tanh’(x) has a steeper gradient near the origin (x = 0) compared to the sigmoid function. This means that it can alleviate the vanishing gradient problem better than the sigmoid function, which can be advantageous in deep neural networks.

Use Cases:

The tanh function is often used in hidden layers of neural networks, especially in recurrent neural networks (RNNs) and Long Short-Term Memory networks (LSTMs).
It can be useful when you want to map input data to values that are both positive and negative, and when the data distribution is centered around zero.

Advantages:

Zero-Centered: The mean of the tanh function is zero, which helps mitigate the vanishing gradient problem.
Stronger Gradients: Tanh has steeper gradients around the origin, which can lead to faster convergence during training compared to the sigmoid function.

Disadvantages:

Vanishing Gradient: Although less prone to the vanishing gradient problem than sigmoid, it can still suffer from it when used in deep networks.

The tanh activation function is a valuable alternative to the sigmoid, especially in situations where zero-centered outputs and stronger gradients are desired. It is commonly used in hidden layers of neural networks and can help improve convergence in training deep models.

Code:

def tanh(x):
    return np.tanh(x)

Relu Activation function:

Certainly, the Rectified Linear Unit (ReLU) is a widely used activation function in the hidden layers of a neural network.

f(x) = max(0, x)

Here, `x` is the input to the function, and `f(x)` is the output.

Graph:

The ReLU function is defined as ReLU(x) = max(0, x), which means it returns the input value if it’s positive (greater than or equal to zero) and returns zero if the input is negative.
For all x values greater than or equal to zero, the ReLU function returns x itself. It’s a linear function with a slope of 1 in this region.
For all x values less than zero, the ReLU function returns zero. This means it’s a flat horizontal line at y = 0 for negative inputs.
The ReLU function is a piecewise linear function with a “kink” at x = 0.

Derivative of ReLU:

The derivative of the ReLU function, denoted as ReLU’(x), is a piecewise function as well.
For all x values greater than zero, ReLU’(x) = 1. This means the derivative is a constant 1 for positive inputs.
For all x values less than or equal to zero, ReLU’(x) = 0. The derivative is zero for negative inputs.
The derivative is undefined at x = 0 due to the kink in the ReLU function. In practice, you can set ReLU’(0) to either 0 or 1, depending on your implementation. Some libraries use 0, and others use 1.

Interpretation:

The ReLU activation function introduces non-linearity into neural networks because it behaves linearly for positive values and remains inactive (outputting zero) for negative values.
The derivative of ReLU is either 1 or 0, making it suitable for backpropagation in deep learning models, as it effectively propagates errors backward through the network.
ReLU is a popular activation function for deep neural networks, as it helps mitigate the vanishing gradient problem, allowing networks to learn complex patterns and representations.

In the graph, you’ll see a linear increase for positive values in the ReLU plot, and the derivative is 1 in that region. For negative values, the ReLU plot remains flat at y = 0, and the derivative is 0. The kink at x = 0 is where the function changes behavior.

Key Characteristics:

1. Piecewise Linear: ReLU is a piecewise linear function, which means that it’s linear for positive values of `x` and zero for negative values of `x`. This linearity makes it computationally efficient.

2. Non-Linearity: Although ReLU is linear for positive values, it introduces non-linearity into the network. This non-linearity allows neural networks to model complex, non-linear relationships in data, making them suitable for a wide range of tasks, including image recognition and natural language processing.

3. Sparsity: ReLU introduces sparsity in the network because it sets all negative values to zero. Sparse activations can help reduce overfitting by preventing the co-adaptation of neurons.

4. Vanishing Gradient: Unlike activation functions like sigmoid and tanh, ReLU does not suffer from the vanishing gradient problem to the same extent. This allows for more stable and efficient training of deep networks.

5. Rectification: The term “Rectified” in ReLU comes from the fact that it rectifies (sets to zero) any negative input values, effectively removing the negative part of the signal.

6. Range: The range of ReLU is [0, ∞), which means that for positive inputs, it passes the input value as is, and for negative inputs, it outputs zero.

Advantages:

Simplicity: ReLU is computationally efficient and easy to implement.
Non-linearity: It introduces non-linearity to the model, allowing it to learn complex patterns.
Mitigating Vanishing Gradient: It helps address the vanishing gradient problem, making it suitable for deep networks.

Disadvantages:

Dead Neurons: ReLU neurons can sometimes be “dead” (always output zero) during training if their weights are updated in a way that keeps their inputs always negative. This can slow down learning.
Not Centered at Zero: ReLU is not centered at zero, which can lead to convergence issues in some cases.
Exploding Gradients: ReLU can suffer from exploding gradients, although this is less common than the vanishing gradient problem.

Variations:

Leaky ReLU: To address the “dying ReLU” problem, Leaky ReLU allows a small, non-zero gradient for negative inputs. It is defined as `f(x) = max(αx, x)`, where `α` is a small positive constant.
Parametric ReLU (PReLU): PReLU extends Leaky ReLU by making the leakage rate (`α`) learnable during training.
Exponential Linear Unit (ELU): ELU is another variation that has a non-zero gradient for negative inputs and is centered at zero.

ReLU is a fundamental activation function in deep learning and has contributed to the success of deep neural networks in various applications. Its simplicity, non-linearity, and ability to mitigate vanishing gradients make it a popular choice in practice.

Code:

def relu(x):
    return np.maximum(0, x)

Leaky Relu Activation Function:

The Leaky ReLU (Rectified Linear Unit) activation function is a variant of the standard ReLU function. It addresses some of the limitations of the ReLU function, which can “die” during training due to neurons getting stuck in the zero-activation region.

Leaky ReLU(x) = x if x > 0
Leaky ReLU(x) = α * x if x <= 0

Here, α (usually a small positive value like 0.01) is the slope of the function for negative inputs.

Graph:

In the graph of Leaky ReLU, for all positive inputs (x > 0), it behaves like a standard ReLU, which means the output is equal to the input. When the input is negative (x < 0), it grows linearly with a slope of α, allowing a small gradient.

In the derivative graph, you’ll see that for all positive inputs, the derivative is 1, indicating that the gradient is preserved. For negative inputs, it’s α, which is a constant smaller than 1. Here we considered α to be 0.01.

Characteristics:

1. Non-linearity: Leaky ReLU is a non-linear activation function that introduces non-linearity into neural networks, allowing them to learn complex patterns and representations.

2. Variability: Unlike the standard ReLU function, Leaky ReLU has a small slope for negative values, which adds some variability. This helps prevent neurons from becoming inactive, as is the case with the ReLU function.

Advantages:

1. Mitigates “Dying ReLU” Problem: Leaky ReLU helps prevent the “dying ReLU” problem, where neurons can become inactive (always outputting zero) during training. By introducing a small slope for negative inputs, Leaky ReLU allows gradients to flow and neurons to recover from being stuck in the zero-activation region.

2. Non-zero Mean: Leaky ReLU has a non-zero mean, which can be beneficial in some cases compared to ReLU, which has a mean of zero. This can make optimization easier.

3. Simple Implementation: Implementing Leaky ReLU is straightforward, and it has only one additional hyperparameter (α) to tune.

Disadvantages:

1. Not Always Better: While Leaky ReLU addresses some issues with the standard ReLU, it doesn’t guarantee improved performance. The choice between ReLU, Leaky ReLU, or other activation functions depends on the specific problem and dataset.

2. Unbounded Activation: Leaky ReLU is unbounded, which means it can produce very large or very small activations. In some cases, this unbounded nature can lead to exploding gradients during training.

In summary, Leaky ReLU is a variant of the ReLU activation function designed to prevent neurons from becoming inactive during training. While it can be a useful choice for mitigating the “dying ReLU” problem, it is not a one-size-fits-all solution, and its performance may vary depending on the specific problem and dataset.

Code:

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Parameterized ReLU Activation Function:

Parameterized Rectified Linear Unit (PReLU) is an activation function that is a variation of the traditional Rectified Linear Unit (ReLU) activation.

The PReLU activation function introduces a learnable parameter, α, which scales the negative side of the function that can be adjusted during training. It is defined as:

f(x) = { αx if x < 0; x if x ≥ 0 }

Key Differences between Leaky ReLU and Parameterized ReLU:

Leaky ReLU has a fixed, predefined slope (α) that remains the same for all neurons and features. In contrast, PReLU introduces learnable parameters, allowing different neurons to have different slopes.
The slope of Leaky ReLU is a hyperparameter set by the user, while the slope of PReLU is learned from the data during training.
PReLU is generally considered more flexible because it can adapt to the data by learning different slopes for different features or neurons.

Characteristics:

1. Non-linearity: PReLU introduces non-linearity into the model, which is essential for learning complex patterns in data.

2. Learnable Parameters: Unlike traditional ReLU, PReLU has learnable parameters. It adds a slope to the negative side of the activation, allowing the network to learn the optimal slope during training.

3. Variability: PReLU introduces variability in activation functions, which can help the model adapt to different types of data.

Advantages:

1. Mitigating Dead Neurons: PReLU helps mitigate the “dying ReLU” problem, which can occur when neurons always output zero. Allowing negative values and learning the slope, can address this issue.

2. Increased Model Flexibility: PReLU adds an extra degree of flexibility to the model by allowing it to learn different slopes for different neurons.

3. Improved Training: It can lead to faster and more stable convergence during training.

Disadvantages:

1. Increased Model Complexity: PReLU introduces more parameters into the model, making it more complex and potentially prone to overfitting if not properly regularized.

Graph:

The graph of the PReLU activation function resembles the ReLU function but with a learnable slope on the negative side. The function outputs the input value for non-negative inputs (x ≥ 0) and scales the negative inputs by α for x < 0.

The graph on the negative side of the function can have various slopes, which are determined during training. This allows the network to adapt to the data and learn the optimal slopes for different neurons.

In summary, PReLU is a variant of ReLU that addresses the dying ReLU problem by introducing learnable parameters. While it can make training more efficient and stable, it also adds complexity to the model, which should be carefully managed to prevent overfitting.

Code:

def parametric_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Exponential Linear Unit

It addresses some of the limitations of the traditional ReLU function, especially the “dying ReLU” by modifying the slope of the negative part of the function using a log curve.

For x > 0: eLU(x) = x
For x <= 0: eLU(x) = alpha * (e^x - 1)

where alpha is a hyperparameter typically set to a small positive value (e.g., 0.01).

Graph:

For positive values of x, the eLU is simply the identity function, and it passes the input through unchanged.
For non-positive values of x, it applies an exponential growth to the negative inputs with the exponential term e^x - 1.

The eLU graph is characterized by:

It is a smooth curve that smoothly transitions through the origin (0,0).
It allows negative values and doesn’t saturate for strongly negative inputs.
It is differentiable everywhere.

eLU'(x) = { 1, if x > 0
           eLU(x) + alpha, if x <= 0 }

The derivative graph of eLU is characterized by:

It is also a smooth curve.
For positive x, the derivative is 1, which means the gradient is preserved for positive inputs.
For negative x, it asymptotically approaches a small positive value, typically alpha. This prevents gradients from vanishing for strongly negative inputs.

Characteristics:

Zero-Centered: The eLU activation function is not zero-centered. It’s centered around the value alpha for negative inputs, which is usually set to a small positive value like 0.01.
Outputs for Negative Values: eLU outputs negative values for negative inputs, allowing neural networks to learn from negative input values.
No Vanishing Gradient or Dead Neurons: eLU mitigates the “dying ReLU” problem by providing a gradient for all input values, including negative ones. This helps prevent neurons from becoming inactive during training.
Derivative of Zero: The derivative of the eLU function is defined and computationally tractable for all input values, avoiding the vanishing gradient problem. However, its derivative involves exponential functions and can be computationally more expensive than other activation functions like ReLU.
Time Complexity for Derivative: The time complexity for computing the derivative of eLU is higher due to the presence of exponential functions, which can make it slower to compute compared to simpler activation functions.

Advantages:

Avoids the “dying ReLU” problem: The eLU function doesn’t suffer from the vanishing gradient problem for negative inputs as experienced with traditional ReLU.
Smooth gradient: It provides a smooth gradient for all values of x, which can help during training.
Allows negative values: eLU can output negative values for negative inputs, which allows the network to have a more balanced learning.

Disadvantages:

Computational cost: The exponential function calculation can be computationally expensive compared to simpler activation functions like ReLU.
Not zero-centered: eLU is not zero-centered because it takes on non-zero negative values for x <= 0. This can lead to issues in training under certain conditions.

Code:

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

Swish Activation Function

The Swish activation function is a relatively newer activation function introduced by researchers at Google. It is characterized by its smoothness and is designed to combine some of the advantages of other activation functions.

Swish(x) = x * sigmoid(beta * x)

Here, `beta` is a hyperparameter that can be tuned.

Graph:

The graph of the Swish function shows a smooth S-shaped curve. As x approaches negative infinity, the output approaches zero, and as x approaches positive infinity, the output approaches infinity. The function is continuous and differentiable everywhere.

f′(x) = f(x)+1+e−bx1⋅(1−f(x))

The derivative graph of Swish also exhibits an S-shaped curve. It’s non-monotonic, allowing it to capture complex patterns in the data. The derivative is highest around the origin, making the learning more effective during backpropagation.

Characteristics:

1. Smoothness: Swish is a smooth, differentiable function. It inherits this property from the sigmoid function, which is part of its definition.

2. Non-linearity: Swish is a non-linear function, that allows neural networks to model complex relationships between inputs and outputs.

3. Zero-Centered: Swish is approximately zero-centered, meaning it has values around zero for most inputs, which can help with training stability.

Advantages:

1. Training Speed: Swish has been reported to train faster than ReLU on deep neural networks. This can be attributed to its non-monotonic, convex shape.

2. Smoothness: The smoothness of Swish helps with gradient propagation during training, reducing the likelihood of gradient-related issues like vanishing gradients.

3. Zero-Centered: Being approximately zero-centered helps with optimization, as it can maintain negative and positive activations, preventing weights from drifting too far in one direction.

4. Empirical Performance: Swish has shown good empirical performance on various tasks and is a competitive alternative to other activation functions.

Disadvantages:

1. Computational Cost: Swish is more computationally expensive than ReLU, primarily due to the sigmoid operation.

2. Hyperparameter: The `beta` hyperparameter in the Swish function needs to be tuned. While setting it to 1 is common, it might not be the optimal value for all tasks.

3. Lack of Theoretical Justification: While Swish has shown good performance in practice, it lacks the strong theoretical justification of some other activation functions like ReLU.

Code:

def swish(x, beta=1.0):
    return x / (1 + np.exp(-beta * x))

SoftMax Activation Function:

The softmax activation function is commonly used in multi-class classification problems. It takes an input vector and squashes the values between 0 and 1, normalizing them so that they add up to 1. This makes it useful for converting a vector of arbitrary real values into a probability distribution.

Given an input vector (z = [z_1, z_2, …, z_k]), the softmax function is defined.

Formula for soft max Activation function

Here, (e) is the base of the natural logarithm, and (k) is the number of classes.

Characteristics:

Normalization: Softmax converts input values into probabilities that sum to 1, making it suitable for multi-class classification problems.
Non-Negativity: The output probabilities are always between 0 and
Differentiability: The function is differentiable, which is essential for training neural networks using gradient-based optimization algorithms.

Advantages:

1. Interpretability: Softmax provides a clear probability distribution over classes, aiding in result interpretation.
2. Compatibility: Well-suited for multi-class classification tasks.

Disadvantages:

1. Sensitivity to Outliers: Softmax is sensitive to extreme values in the input vector.
2. Mutual Exclusivity: Assumes that classes are mutually exclusive, which might not be suitable for all problems.

Use Case:

Softmax is often the activation function of choice in the output layer of neural networks for multi-class classification problems.

Graph:

As the input values increase, the corresponding softmax values tend to 1.
As the input values decrease, the corresponding softmax values tend to 0.
The softmax function essentially amplifies the differences between the input values, emphasizing the most significant ones in the probability distribution.

Derivative:

It reaches its maximum at the point where the softmax output is 0.5. This means the derivative is highest when the model is uncertain about the class assignment.

Code:

def softmax(x):
    exp_values = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_values / np.sum(exp_values, axis=-1, keepdims=True)

Choosing the right activation function

Step Function:

Rarely used in modern neural networks. Occasionally in the output layer for binary classification problems.

Sigmoid Function:

Output layer for binary classification. Hidden layers when the goal is to transform data into a probability distribution.

Tanh Function:

Similar to sigmoid but with zero-centered output. Used in scenarios where zero-centered output is desired.
Sigmoids and tanh functions are at times avoided due to the vanishing gradient problem

ReLU (Rectified Linear Unit):

Hidden layers for most tasks. Simple and effective but may not perform well on all types of data.

Leaky ReLU:

A variation of ReLU, useful to address the dying ReLU problem. Suitable for scenarios where ReLU might be too aggressive.

Parametric ReLU (PReLU):

Similar to Leaky ReLU but with a learnable parameter. Can be useful when the degree of leakiness needs to be optimized.

Exponential Linear Unit (ELU):

Addresses vanishing gradient problem, allows negative values, and can be suitable when the model needs to capture more complex patterns.

Swish Activation:

Designed to perform well across various tasks. Can be a good default choice if unsure.

Softmax Activation:

Output layer for multi-class classification. Converts raw scores into probabilities.

Choosing the right activation function often involves experimentation, and there is no one-size-fits-all solution. The choice depends on the specific characteristics of the data and the requirements of the task at hand.

All about Activation functions & Choosing the Right Activation Function

Step Activation function:

Graph:

Advantages:

Disadvantages:

Code:

Sigmoid Activation Function:

Graph:

Use Cases:

Advantages:

Disadvantages:

Tanh Activation Function:

Graph:

Use Cases:

Advantages:

Disadvantages:

Code:

Relu Activation function:

Graph:

Key Characteristics:

Advantages:

Disadvantages:

Variations:

Code:

Leaky Relu Activation Function:

Graph:

Characteristics:

Advantages:

Disadvantages:

Code:

Parameterized ReLU Activation Function:

Characteristics:

Advantages:

Disadvantages:

Graph:

Code:

Exponential Linear Unit

Graph:

Characteristics:

Advantages:

Disadvantages:

Code:

Swish Activation Function

Graph:

Characteristics:

Advantages:

Disadvantages:

Code:

SoftMax Activation Function:

Characteristics:

Advantages:

Disadvantages:

Use Case:

Graph:

Code:

Choosing the right activation function

Written by Anushruthika