ReLU: A Short Overview on the most Popular Activation Function

4 min readApr 5, 2024

Among the most commonly used activation functions are the hyperbolic tangent (tanh), sigmoid, and rectified linear unit (ReLU), each with its own advantages and applications. This analysis aims to compare these functions in terms of their gradient distribution, computational efficiency, and impact on network training dynamics.

Rectified Linear Unit (ReLU):

ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero.
The equation is :
max(wx+b,0).

This means that there is a linear function, which is cut off below 0. What happens if you sum a bunch of these together? If you sum lines with lines, you get lines of a different angle. If you sum lines with 0, you get the same line. So the sum of these ReLU functions is a set of connected lines (hyperplanes) of different angles.

This simplicity offers several advantages:

Computational Efficiency:
ReLU’s forward pass and gradient computation are extremely fast, as the derivative is either 0 (for negative inputs) or 1 (for positive inputs). This efficiency is a significant advantage in deep networks with many layers.
Gradient Propagation:
ReLU facilitates effective gradient propagation on the activated nodes, allowing for deeper networks without the severe vanishing gradients problem seen in tanh and sigmoid functions. This is because ReLU does not saturate in the positive domain, unlike tanh and sigmoid, which have regions where their derivatives are almost zero, leading to vanishing gradients.
Sparsity and Regularization:
ReLU inherently introduces sparsity in the network’s activations, as some neurons output zero. This sparsity can be beneficial for the model by introducing natural regularization. Additionally, the use of L1 or L2 regularization with ReLU can further mitigate the potential issues of dying ReLUs (where neurons permanently output zeros) and exploding weights, improving the network’s generalization capabilities.
Enhancements and Alternatives:
1. Leaky ReLU: To address the problem of dying ReLUs, Leaky ReLU allows a small, positive gradient when the unit is not active. This modification retains many of the benefits of ReLU while allowing for gradient flow through the network, even for negative inputs.

2. Normalization and Non-Linearity: It’s crucial to note that while ReLU and its variants may appear linear at first glance, they introduce non-linearity essential for deep learning models. This non-linearity is particularly evident when data is normalized around zero, as the activation varies non-linearly based on the input’s sign.

Now lets take a glance at how some of the other activation functions compare with ReLU.

Hyperbolic Tangent (tanh) vs. Sigmoid

Both tanh and sigmoid functions are s-shaped curves but differ in their output ranges. The tanh function outputs values between -1 and 1, making it zero-centered, which helps in distributing gradients more effectively. This characteristic is particularly advantageous for alleviating the issue of vanishing gradients, a common problem in deep neural networks where gradients become increasingly small, leading to slow or stalled training.

In contrast, the sigmoid function outputs values between 0 and 1. While it is useful in cases where a probability output is needed (for example, in binary classification), its non-zero-centered output can lead to the problem of gradients vanishing during backpropagation. This is because, in very deep networks, the product of many small derivatives (as is common with sigmoid outside a specific range of values) leads to an exponentially smaller gradient, significantly diminishing the network’s learning capacity.

📚 Resources:

Andrej Karpathy’s notes for Stanford’s CS231 :Things to consider when choosing activation functions
Overview of the above in Video: Youtube
Research Paper on “A Survey on Activation Functions and their relation with Xavier and He Normal Initialization”

“Make it work, make it right, make it fast.” — Kent Beck

Optimizing Data Processing with ORC, Parquet, Avro

Many Data Professionals deal with massive volumes of data that need to be processed, analyzed, and stored efficiently…

medium.com

Meet Devin, the first AI software engineer who can develop, debug, and implement code.

Beyond the capabilities of existing Large Language Models (LLMs) like OpenAI’s ChatGPT, Devin is an AI that can handle…

blog.cubed.run

China and the United States: AI Partners in Crime?

According to the 2022 AI Index Report, a comprehensive tracker of AI trends compiled by the Stanford Institute for…