Activation Functions: The Hidden Heroes of Neural Networks

Neha Purohit
9 min readSep 13, 2023

Last week we saw the perceptron and its importance.

The entertainment industry has undergone a captivating transformation in recent years, thanks to the remarkable advancements in artificial intelligence and machine learning. Activation functions have become a crucial component in the field of natural language processing. These functions are used to weigh the importance of different parts of a text and have been instrumental in tasks such as machine translation, text summarization, and sentiment analysis.

History of Activation Function

The hyper-parameters of a neural network are traditionally designed through a time consuming process of trial and error that requires substantial expert knowledge. Neural Architecture Search (NAS) algorithms aim to take the human out of the loop by automatically finding a good set of hyper-parameters for the problem at hand.

These algorithms have mostly focused on hyper-parameters such as the architectural configuration of the hidden layers and the connectivity of the hidden neurons, but there has been relatively little work on automating the search for completely new activation functions, which are one of the most crucial hyperparameters to choose. There are some widely used activation functions nowadays which are simple and work well, but nonetheless, there has been some interest in finding better activation functions.

INTRODUCTION

In a neural network, the output of a neuron is the result of a linear function followed by an arbitrary function. The linear function is a weighted sum of the inputs, and the arbitrary function is the activation function. The activation function introduces non-linearity into the network, which is essential for learning complex patterns.

A neural network with only linear activation functions is only capable of learning linear relationships. This is because the composition of linear functions is still linear. To learn non-linear relationships, we need to use non-linear activation functions.

The Universal Approximation Theorem states that a neural network with one hidden layer and a bounded, non-constant, continuous activation function can approximate any continuous function. This means that, in principle, a neural network with a single hidden layer can be used to learn any task.

The specific properties that make a good activation function are not fully understood. However, some properties that have been shown to be beneficial include:

  • Monotonicity: The activation function should be increasing or decreasing. This helps the network to converge during training.
  • Continuous differentiability: The activation function should be differentiable everywhere. This allows the network to be trained using backpropagation.
  • Zero-centeredness: The activation function should have a mean of zero. This can help to improve the stability of the network.
  • Non-saturating: The activation function should not saturate to a constant value. This ensures that the network can continue to learn even after it has seen many examples.

The rectified linear unit (ReLU) is a popular activation function that satisfies many of these properties. It is a simple function that outputs the input directly if it is positive, and zero otherwise. ReLU has been shown to be effective for training deep neural networks.

There has been some research in designing new activation functions. However, most of the research has focused on hand-designed activation functions. There is an opportunity to develop new activation functions using machine learning techniques. This could lead to the discovery of activation functions that are even more effective than the ones that are currently used.

Source : cMelGAN: An Efficient Conditional Generative Model Based on Mel Spectrograms

The work in the literature has mostly focused on designing new activation functions by hand, or choosing from a set of prede!ned functions while this work presents an evolutionary algorithm to automate the search for completely new activation functions. We compare these new evolved activation functions to other existing and commonly used activation functions.

In this blog post, let’s explore this safari of attention models and discuss why attention models have become a critical component in modern machine learning applications.

The Foundation of Perceptrons was discussed last week (https://medium.com/@neha.purohit.ai/the-future-of-neural-networks-may-lie-in-the-co-existence-of-neurons-and-perceptrons-d9cd0dfdd130)

Perceptron’s limitation in its ability to solve complex tasks requiring nonlinear decision boundaries led to the development of more versatile neural network architectures.

Expanding Horizons: Multilayer Neural Networks

Researchers developed multilayer neural networks aka feedforward neural networks, introducing hidden layers and activation functions, and capturing nonlinear relationships within data.

Key features of multilayer neural networks:

  • Multiple hidden layers.
  • Nonlinear activation functions (e.g., Sigmoid, TanH, ReLU, Leaky ReLU, Softmax etc).
  • Improved ability to handle complex tasks.

Why do Neural Networks Need an Activation Function?

To add non-linearity to the neural network, Let’s deep dive into understanding Activation Functions individually:

Binary step Function:

This activation function is very basic and it comes to mind every time if we try to bound output. It is basically a threshold base classifier, in this, we decide some threshold value to decide output that neuron should be activated or deactivated.

Mathematical expression:

Limitations of binary step function:

Non-differentiability: Binary step functions are non-differentiable at the threshold point, making them unsuitable for gradient-based optimization.

Lack of Information: They can’t convey information about input intensity or magnitude, leading to information loss.

Vanishing Gradients: When used in deep neural networks, binary steps can cause vanishing gradients, hindering learning.

Mathematically represented as:

Limitations — The activation function resides within the hidden layer and must not be linear. Regardless of how intricate the architecture may be, a linear activation function is only effective when applied to a single layer. Moreover, real-world issues are inherently characterized by high levels of non-linearity. The sole scenario where linear activation may yield advantages is in the context of regression tasks, such as predicting future sales.

Non-Linear Activation Functions

Sigmoid Activation Function:

The sigmoid activation function, also popularly called as the logistic activation function, is a common non-linear activation function. It is named “Sigmoid” because it is S-shaped in its graph, it maps any real-value to a value between 0 and 1, suitable for binary classification problems and for probabilities.

Sigmoid’s formula is σ(x) = 1/(1+exp(-x)) where x is any real value.

Graphically, Sigmoid can be represented as:

Fundamentally speaking, although multilayer neural networks paved the way for significant progress in various machine learning applications,it had its own challenges i.e vanishing gradients in deep networks.

Due to the vanishing gradient problem and other limitations such as saturation and not zero-centered, sigmoid activations were replaced by Hyperbolic Tangent Activation (Tanh), rectified linear unit (ReLU) and its variants in many modern neural network architectures. However, sigmoid activations are still used in the output layer of binary classification models where the goal is to produce probabilities for two classes.

The hyperbolic tangent (tanh) activation function was termed as an alternative to the sigmoid activation function to address Sigmoid function’s limitations. Sigmoid shows slower convergence during training, as weights do not get updated in a balanced manner, causing gradients to become biased. In contrast, tanh was introduced to provide a zero-centered range (-1, 1).

Tanh is similar in shape to the sigmoid function; it’s easy to swap out sigmoid activations with tanh activations in existing models without significantly altering the network’s behavior. This similarity allowed researchers and practitioners to transition from sigmoid to tanh activations relatively smoothly. The tanh function is symmetric around the origin (0, 0) which is beneficial in neural networks. It allows positive and negative signals to be propagated effectively through the network, enabling the network to learn more balanced representations of data.

Tanh is represented as:

The disadvantages of the hyperbolic tangent (tanh) activation function, along with the sigmoid activation function (as outlined above), paved the way for widespread adoption of the rectified linear unit (ReLU) activation function. Here’s how the limitations of tanh and sigmoid led to the emergence of ReLU:

ReLU was introduced as a remedy for the vanishing gradient problem. Unlike tanh, ReLU has a linear form, it does not saturate for positive inputs. Consequently, it avoids the vanishing gradient issue for positive values, enabling faster and more efficient training of deep networks. Also, ReLU’s non-saturating nature, especially around positive inputs, allows networks to learn more quickly and effectively.

ReLU formula is :

If the function receives any negative input, it returns 0; however, if the function receives any positive value x, it returns that value. As a result, the output has a range of 0 to infinite.

ReLU function has some flaws like exploding gradients. It’s the polar opposite of the vanishing gradient, it occurs when significant errors accumulate during training, resulting in massive modifications to model weights. The model is unstable as a result, and it is unable to learn from your training data.As for the ReLU activation function, the gradient is 0 for all the values of inputs that are less than zero, which would deactivate the neurons in that region and may cause dying ReLU problem.

Leaky ReLU is defined to address this problem. Instead of defining the ReLU activation function as 0 for negative values of inputs(x), we define it as an extremely small linear component of x.

Leaky ReLU is represented as:

Another type of activation function for multi-class classification tasks is called softmax function, where the goal is to compute class probabilities across multiple classes in the output layer.It takes a vector of real-valued scores and transforms them into a probability distribution over multiple classes. The output represents the likelihood of each class being the correct one.This function is primarily used in the output layer of a neural network for tasks like image classification, natural language processing, and various other multi-class classification problems. It helps convert the network’s raw scores into class probabilities.

Source:Sentence-Level Classification Using Parallel Fuzzy Deep Learning Classifier

Formula for Softmax is:

The transition from traditional activation functions to attention mechanisms in neural networks represents an evolution driven by the need to handle more complex and context-aware tasks.

Activation functions like sigmoid, tanh, ReLU, etc., are applied element-wise within a layer to transform the weighted sum of inputs into an output. These activations enable neurons to activate around certain conditions.

A paradigm shift in the field of machine learning has been brought about by the introduction of Optimizers for mitigating loss function.

More on Optimizers next week.

If you enjoy reading stories like these and want to support my writing, please consider Follow and Like . I’ll cover most deep learning topics in this series.

--

--

Neha Purohit

Unleashing potentials 🚀| Illuminating insights📈 | Pioneering Innovations through the power of AI💃🏻