Activation Functions in Neural Networks

Published in

Analytics Vidhya

15 min readDec 6, 2020

Activation Functions determine the output of the neural network. They are responsible for the accuracy of the neural net and the computation power needed to train the network. These activation functions are attached to each neuron in the network and function as a gate thus ‘firing’ the neuron when the right set of inputs is received.

Activation functions are a fundamental component of artificial neural networks, influencing both the network’s ability to learn complex patterns and its computational efficiency. By introducing non-linearity, activation functions enable neural networks to approximate intricate functions and solve complex tasks such as image and speech recognition.”

Example — You dip your hand in cold water and you feel the sensation of cold and when you dip your hand in hot water you sense the hotness. This happens due to the activation of certain neurons inside the brain which are responsible for the detection of hotness and coldness.

Representation of working of AF in a neuron -

Representation of the Activation Function

Types of Activation functions -

Linear Activation Functions

Binary Step Function

This is a threshold-based AF. You decide a threshold and if the output value is greater than the threshold then the neuron is activated, if it is less than the threshold it is not fired. The problem with this AF is that if you have multiple neurons that output activated then which one to consider?. As a result, the network employed with the Binary Step function cannot classify the output into one of the many categories of the output desired.

The binary Step function is not differentiable so no scope for backpropagation.

2. Linear Function

Takes inputs multiplied with weights and adds bias. After this Linear AF is applied which generates output corresponding to the input signal. Thus, it works well with multiple categories and we can take the output of the neuron with maximum probability.

The slope of a straight line is a constant so in backpropagation we don’t know by how much weights should be adjusted and for which neuron. So, the network will not be efficient and will not understand the inputs properly which results in very low accuracy in the output layer.

Non-Linear Function

Works well with the complex networks allowing to build deeper networks with many layers. They can work with complex datasets which include images, audios, and videos. Also, they can efficiently adjust weights in backpropagation.

Sigmoid AF

Whatever input it receives output values will between 0 and 1.

Disadvantages -

● The curve is not normally distributed so the network is computationally expensive and reaching global minima takes a lot of time.

● Vanishing gradient problem will happen in backpropagation due to which new weight that has to be added will be equal to the old weight and learning of the network is diminished.

Tanh AF

Whatever input it receives, output values will between -1 and 1.

Advantage — Normally distributed or Zero centered data, so less computationally expensive.

Disadvantage — Vanishing gradient problem will happen in backpropagation due to which new weight that has to be added will be equal to the old weight and learning of the network is diminished.

ReLu (Rectified Linear Unit) AF

● If the value of x > 0, then output is x.

● If the value of x < 0 then output is 0.

Advantage — Solves Vanishing gradient problem.

Disadvantage — Dead neuron condition, Derivative of x with respect to x gives 1 but a derivative of negative weight with respect to x gives 0 thus there is no new weight to add in backpropagation resulting in dying relu or dead neuron condition.

Leaky ReLU AF

To avoid a dead neuron state we multiply a small positive number to x but this introduces the vanishing gradient problem again as we see in the example illustrated below -

ELU (Exponential Linear Unit) AF

As there is an exponential term computation is very expensive.

PReLU (Parametric ReLU)

● If x > 0, then output will be the x value.

● If x = 0, then output will be 0.

The above 2 conditions make the activation function work like ReLU.

We introduce ∝ which is nothing but a learning rate.

● If ∝ = 0.01, then it becomes Leaky ReLU.

● If ∝ = 0, then it becomes ReLU.

● Any other value of ∝ then it becomes Parametric ReLU.

Note: ∝ is a hyperparameter.

Swish

Introduction

The Swish activation function is a relatively recent development in the field of deep learning, introduced by Ramachandran et al. (2017). It has gained attention due to its ability to improve neural network performance over traditional activation functions like ReLU. Swish is a smooth, non-monotonic function that combines the input feature and the sigmoid function, allowing it to retain negative values while introducing non-linearity. Its adaptability and performance benefits make it a valuable addition to modern neural network architectures.

Mathematical Definition

The Swish activation function is defined as:

where:

x is the input to the neuron.
σ(βx) is the sigmoid function.
β is a trainable or fixed parameter that controls the shape of the function.

In practice, β can be set to 1, resulting in the simplified version:

Properties

Range: The output of Swish is unbounded above and bounded below, similar to ReLU but with negative values allowed.
Differentiability: Swish is differentiable everywhere, facilitating gradient-based optimization.
Non-Monotonicity: Unlike many activation functions, Swish is non-monotonic, which can improve the expressiveness of neural networks.
Smoothness: Swish is a smooth function without abrupt changes, potentially leading to better optimization dynamics.

Derivative

The derivative of the Swish function with respect to xxx is:

This derivative combines the properties of the sigmoid function and the input x, enabling efficient computation during backpropagation.

Advantages

Improved Performance

Empirical studies have demonstrated that Swish can outperform traditional activation functions like ReLU and its variants in deep neural networks, particularly in very deep architectures with more than 40 layers (Ramachandran et al., 2017).

Non-Monotonicity

The non-monotonic nature of Swish allows it to better model complex patterns and interactions in data, potentially leading to higher representational capacity and improved learning.

Avoids Dying Neurons

Unlike ReLU, which can suffer from the “dying ReLU” problem where neurons become inactive, Swish allows negative inputs to contribute to the output, reducing the risk of neurons ceasing to learn.

Smoothness

The smoothness of Swish can lead to better optimization characteristics, as it avoids the sharp transitions present in functions like ReLU. This can result in more effective gradient propagation and faster convergence.

Disadvantages

Computational Complexity

Swish involves computing the sigmoid function, which includes an exponential operation. This makes it computationally more intensive than ReLU, which only requires a simple comparison operation.

Hyperparameter Sensitivity

The parameter β introduces additional complexity. While setting β=1 simplifies the function, tuning β can lead to better performance but requires careful hyperparameter optimization.

Vanishing Gradient Risk

For large negative inputs, the sigmoid component of Swish can cause the gradient to become very small, potentially leading to vanishing gradient issues in very deep networks.

Applications

Deep Neural Networks

Swish has been applied successfully in various deep learning models, including:

Convolutional Neural Networks (CNNs): For image recognition tasks where deep architectures are common.
Recurrent Neural Networks (RNNs): In models requiring the handling of sequential data.
Transformer Models: As an activation function in attention mechanisms and feedforward networks.

Large-Scale Models

Swish is particularly beneficial in large-scale models with many layers, where its properties can help alleviate issues like vanishing gradients and improve overall performance.

Implementation Considerations

Computational Optimization

To mitigate the increased computational cost, optimized implementations of Swish can leverage hardware acceleration and efficient mathematical libraries. Techniques such as approximation of the sigmoid function or using lookup tables can also be employed.

Parameter β

Fixed β: Setting β=1 simplifies the function and reduces the need for hyperparameter tuning.
Trainable β: Allowing β to be learned during training can adapt the activation function to the data, potentially improving performance but adding complexity.

Framework Support

Deep learning frameworks like TensorFlow and PyTorch support Swish and provide built-in functions, making it easier to integrate into models.

Theoretical Insights

Automatic Search for Activation Functions

Swish was discovered through an automated search using reinforcement learning to identify novel activation functions that improve model performance (Ramachandran et al., 2017). This approach underscores the potential for discovering new functions that outperform traditional choices.

Mathematical Relationship

Swish can be seen as a smooth blend between linear and nonlinear activation behaviors, combining aspects of identity and sigmoid functions. This allows it to adaptively gate information, enhancing the network’s ability to model complex patterns.

Empirical Performance

Studies have shown that Swish can lead to consistent performance gains over ReLU and other activation functions in various tasks:

Image Classification: Improved accuracy on benchmarks like ImageNet.
Natural Language Processing: Enhanced performance in models for language understanding and generation.
Generative Models: Better convergence and sample quality in generative adversarial networks (GANs).

Ramachandran et al. (2017) reported that Swish outperformed ReLU in deep networks, suggesting that its properties are particularly advantageous in complex models.

Comparison with Other Activation Functions

ReLU

Advantages over ReLU: Swish avoids the dying neuron problem and provides smooth gradients.
Disadvantages: Swish is computationally more intensive.

Sigmoid and Tanh

Advantages: Swish mitigates the vanishing gradient problem associated with sigmoid and tanh in deep networks.
Non-Monotonicity: Swish’s non-monotonicity offers benefits not present in sigmoid or tanh functions.

Conclusion

The Swish activation function represents a significant advancement in neural network activation functions, offering performance improvements over traditional functions like ReLU. Its unique combination of smoothness, non-monotonicity, and the ability to retain negative inputs enhances the learning capacity of deep neural networks. While it introduces additional computational complexity, the benefits in model performance and convergence make Swish a valuable tool in modern deep learning applications.

Practitioners should consider the trade-offs between computational cost and performance gains when deciding to implement Swish in their models. Future research may explore further optimizations and applications of Swish, as well as the discovery of new activation functions through automated search methods.

Softplus

Introduction

The Softplus activation function is a smooth approximation of the Rectified Linear Unit (ReLU) function. Introduced by Dugas et al. (2001), it offers the benefits of ReLU while maintaining differentiability everywhere, which can be advantageous in certain neural network architectures. The Softplus function enables the modeling of complex relationships by introducing non-linearity and is particularly useful in deep learning models where smooth activation functions are preferred.

Mathematical Definition

The Softplus function is defined as:

f(x)=ln⁡(1+ex),f(x) = \ln(1 + e^{x}),f(x)=ln(1+ex),

where x is the input to the neuron.

Derivative

The derivative of the Softplus function is the logistic sigmoid function:

This relationship simplifies the computation of gradients during backpropagation.

Properties

Range: (0,∞)
Differentiability: Infinitely differentiable; smooth everywhere.
Non-Linearity: Introduces non-linear characteristics, allowing the network to model complex patterns.
Monotonicity: The function is monotonically increasing.
Approximation to ReLU: As x approaches large positive values, Softplus behaves like ReLU; for large negative values, it approaches zero smoothly.

Advantages

Smoothness

The Softplus function is smooth and differentiable at all points, including at x = 0. This contrasts with ReLU, which is not differentiable at zero. The smoothness can lead to more stable and reliable training in some neural networks.

Avoids Dying Neurons

Since the function outputs a small positive value even for large negative inputs, it mitigates the “dying ReLU” problem, where neurons become inactive and stop learning because they output zero and have zero gradients.

Theoretical Benefits

The smooth nature of Softplus allows for the use of optimization techniques that require higher-order derivatives. It can be advantageous in models where the activation function’s smoothness affects convergence and performance.

Disadvantages

Computational Complexity

The Softplus function involves computing the natural logarithm and the exponential function, which are computationally more intensive than the simple max operation in ReLU. This can slow down training, especially in large-scale neural networks.

Less Sparse Activation

Unlike ReLU, which outputs exact zeros for negative inputs leading to sparse activations, Softplus outputs small positive values. This can result in less sparsity in the network, potentially affecting model generalization and increasing computational load.

Vanishing Gradient Problem

For large negative inputs, the gradient of Softplus approaches zero, similar to the sigmoid function. This can cause the vanishing gradient problem in deep networks, where early layers learn very slowly.

Applications

Deep Learning Models

Variational Autoencoders (VAEs): Softplus is often used in VAEs for modeling latent variables that are strictly positive.
Probabilistic Models: In networks where outputs represent rates or variances, Softplus ensures positive values, which are necessary in certain probabilistic contexts.
Regression Tasks: Suitable for regression problems where the output variable is positive and smooth activation functions are preferred.

Comparison with ReLU

Softplus serves as an alternative to ReLU when a smooth activation function is desired. It retains many of ReLU’s benefits while providing differentiability everywhere.

Implementation Considerations

Numerical Stability

To prevent numerical overflow when xxx is a large positive number, implementations often use the following numerically stable formula:

This reformulation ensures that the exponential does not overflow for large x.

Optimization

Due to the additional computational complexity, it is essential to optimize the implementation of Softplus using efficient mathematical libraries and leveraging hardware acceleration when available.

Framework Support

Popular deep learning frameworks like TensorFlow and PyTorch include optimized implementations of the Softplus function, facilitating its integration into neural network models.

Theoretical Insights

Relationship to Sigmoid Function

The derivative of the Softplus function is the sigmoid function:

This relationship indicates that Softplus integrates the properties of the sigmoid function into its gradient, influencing how weights are updated during training.

Probabilistic Interpretation

Softplus can be interpreted in the context of probabilistic models, such as in the activation functions of probabilistic neural networks, where outputs represent parameters of probability distributions requiring positive values.

Empirical Performance

Studies comparing Softplus and ReLU have shown mixed results, with ReLU often outperforming Softplus in terms of training speed and model accuracy due to its computational simplicity and sparsity. However, Softplus may provide advantages in specific contexts where smoothness and differentiability are critical.

For instance, Glorot et al. (2011) observed that while ReLU accelerates convergence in deep networks, Softplus can be beneficial in models sensitive to activation function smoothness.

Conclusion

The Softplus activation function offers a smooth and differentiable alternative to ReLU, with theoretical benefits in terms of gradient computation and avoidance of dying neurons. While it introduces additional computational overhead and may suffer from vanishing gradients for large negative inputs, it is valuable in specific neural network architectures and applications requiring smooth activation functions.

Understanding the trade-offs between Softplus and other activation functions allows practitioners to select the most appropriate function for their specific application, balancing computational efficiency, training stability, and model performance.

SoftMax

Introduction

The Softmax activation function is a fundamental component in neural network architectures, particularly in multiclass classification problems. It extends the logistic function to multiple dimensions, transforming a vector of arbitrary real-valued scores into a probability distribution over predicted output classes. By ensuring that the output probabilities sum to one, Softmax facilitates the interpretation of outputs as probabilities, which is essential in many machine learning applications.

Mathematical Definition

Given an input vector z = [z1,z2,…,zK] representing the non-normalized log probabilities (also known as logits) for each class K, the Softmax function σ(z) is defined as:

This formula computes the exponential of each zi and normalizes them by the sum of all exponentials, resulting in a vector of probabilities.

Properties

Range: Each σ(z)i ∈ (0,1).

Sum to One: ∑i=1 to K σ(z)i = 1, ensuring a valid probability distribution.

Differentiability: The Softmax function is differentiable, allowing for gradient-based optimization methods in training neural networks.

Non-Linearity: Introduces non-linearity, enabling the network to capture complex patterns.

Derivatives

The derivative of the Softmax function is essential for backpropagation in neural networks. The partial derivative of σ(z)i = with respect to zj=is:

where δij is the Kronecker delta, equal to 1 if i=j and 0 otherwise.

Advantages

Probability Interpretation: Outputs can be directly interpreted as probabilities, which is useful for probabilistic models and decision-making processes.
Multiclass Classification: Facilitates the modeling of multiclass classification problems by converting logits into a probability distribution over classes.
Smooth Gradient: The differentiable nature of Softmax allows for efficient gradient-based optimization algorithms.

Disadvantages

Computational Complexity: Involves exponential calculations and normalization, which can be computationally intensive for large KKK.
Numerical Stability: Exponential functions can lead to numerical overflow or underflow. Careful implementation, such as subtracting the maximum logit zmax from each zi, is required:

Applications

Multiclass Classification

Softmax is predominantly used in the output layer of neural networks designed for multiclass classification tasks, where the goal is to assign an input to one of K possible classes. Examples include:

Image Recognition: Classifying images into categories, such as identifying objects or scenes in computer vision.
Natural Language Processing (NLP): Tasks like language modeling, part-of-speech tagging, and text classification.
Speech Recognition: Deciphering spoken words and phrases into text or commands.

Probabilistic Models

Softmax enables the use of probabilistic interpretations in models, allowing for:

Uncertainty Quantification: Estimating the confidence of predictions.
Decision Making: Facilitating threshold-based or probabilistic decision rules in applications like medical diagnosis or autonomous systems.

Implementation Considerations

Numerical Stability

To prevent numerical issues due to large exponentials, implement the Softmax function using the log-sum-exp trick:

where zmax=max⁡jzj.

Integration with Loss Functions

Softmax is often combined with the cross-entropy loss function for training classification models. The cross-entropy loss measures the difference between the predicted probability distribution and the true distribution.

Optimization

Efficient implementations leverage vectorization and parallel computing to handle high-dimensional data and large numbers of classes.

Theoretical Insights

Relation to Logistic Regression

Softmax generalizes the logistic sigmoid function used in binary classification to multiple classes. In logistic regression, the sigmoid function models the probability of a single class, whereas Softmax handles multiple classes simultaneously.

Entropy and Information Theory

Softmax outputs can be interpreted in the context of entropy and information theory, where the distribution over classes reflects the uncertainty or information content of the predictions.

Empirical Performance

Studies have shown that Softmax activation, when used in conjunction with appropriate architectures and regularization techniques, leads to robust and accurate models in various domains (Goodfellow et al., 2016).

Conclusion

The Softmax activation function is a critical tool in the design of neural networks for multiclass classification problems. Its ability to produce a probability distribution over classes makes it indispensable in applications where probabilistic interpretation is essential. While it introduces computational challenges, careful implementation and optimization can mitigate these issues, allowing Softmax to contribute significantly to the performance of complex neural network models.

Acknowledgments -

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941.
Elfwing, S., Uchibe, E., & Doya, K. (2018). Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Networks, 107, 3–11.
Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Dugas, C., Bengio, Y., Bélisle, F., Nadeau, C., & Garcia, R. (2001). Incorporating Second-Order Functional Knowledge for Better Option Pricing. In Advances in Neural Information Processing Systems (pp. 472–478).
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 315–323).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Nwankpa, C., Ijomah, W., Gachagan, A., & Marshall, S. (2018). Activation Functions: Comparison of Trends in Practice and Research for Deep Learning. arXiv preprint arXiv:1811.03378.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Bridle, J. S. (1990). Training Stochastic Model Recognition Algorithms as Networks Can Lead to Maximum Mutual Information Estimation of Parameters. In Advances in Neural Information Processing Systems (pp. 211–217).

Reach me at —

Email — tejasta@gmail.com

LinkedIn — https://www.linkedin.com/in/tejasta/

Thanks for reading!