Activation Functions: Coffee for neurons☕🧠 (Part 1)

Aditya Jethani
9 min readMar 27, 2024

1. Introduction

Think of neurons, the basic structure of the nervous system, as sleepy coffee drinkers. It takes different inputs — some strong, some weak — just like different flavors of coffee. But just as a person needs an infusion of caffeine to truly wake up, so does its activation function to determine when neurons will “fire” and send their signals together.

Why “Coffee”? The Need for Nonlinearity
Picture this: You’re trying to teach a robot how to recognize a cat. Without the activation functions, the robot would just see a bunch of pixel values, a jumble of numbers. Just as coffee allows us to see the world in a more nuanced way, activation functions transform those raw pixel values ​​into meaningful patterns. They introduce something scientists call “nonlinearity.”

But there is a catch. If all the neural networks were multiplied and put together, it wouldn’t be more powerful than your average calculator. Magic happens because of the activation function. This little math job injects “non-linearity” into your stomach. Controlling our traffic lights to bend the road is like making all kinds of complex patterns that are impossible with just straight lines.

Activation functions allow our neural network to bend and twist its decision boundaries, morphing them to fit the complexities of the real world. Suddenly, instead of a single ‘cat vs. dog’ line, we have an intricate web of curves, perfectly tailored to recognize furry friends of all shapes and sizes.

Let’s make it concrete. Say you’re feeding images of cats and dogs into your robot. Ideally, you want ‘cat’ neurons to fire strongly when a cat picture comes in, and stay silent when it’s a dog. Activation functions make this possible, allowing the robot to learn complex distinctions way beyond what simple calculations could achieve. Simply stating the most commonly used Activation Functions are: Sigmoid σ(x), Tanh[α], ReLU, Leaky ReLU and Softmax.

Let’s dive into these functions one by one. We’ll see how their unique shape makes them suitable for different problems, and we’ll play with real-world databases to see how they behave.

2. Description

Activation functions fall majorly under three classes: Binary Step, Linear and Non Linear. Depending on the problem statement, type of datasets used and desired output requirement, one can choose different options for perfecting their architecture for neural network.

2.1 Binary Step Function:

The binary step function acts like a strict gatekeeper for the neuron. It analyzes the input and compares it to a set limit. If the input surpasses that limit, the neuron gets the green light to fire its signal. Otherwise, the signal gets stopped in its tracks. Think of the binary step function as a light switch for neurons. It evaluates the input against a predefined threshold value.

Binary Step Function (source: v7labs)

Use cases:

  • Simplified Models: In scenarios where computational efficiency is paramount and the network architecture is deliberately simple, the binary step function can be used to create very basic decision-making neurons.
  • Control Systems: In certain control systems, where clear on/off states are needed, binary step functions can be used to trigger actions based on thresholds.

Advantages

  • Extremely Simple: Its on/off nature makes it incredibly easy to compute.
  • Interpretability: The threshold-based behavior provides very clear decision points.

Disadvantages

  • Vanishing Gradients: The gradient of the binary step function is zero almost everywhere. This makes it impossible to train using traditional gradient-based learning algorithms like backpropagation, severely limiting its use in modern neural networks.
  • Not Differentiable: The sharp discontinuity at the threshold prevents the calculation of derivatives.

2.2 Linear Activation Function:

The linear activation function is the most straightforward one in the neural network toolbox. It’s like a direct pipeline — the input goes in, and the exact same value comes out the other side. There’s no fancy transformation or squashing involved.

Math:

f(x)=x

Simple F(x)=y function represents a linear activation (source: v7labs)

Use cases:

  • Output Layers for Regression: Sometimes, especially when predicting continuous values (house prices, stock trends), you want the raw output of your network. A linear activation at the output layer makes sense in these cases.

Advantages:

  • Initializing Networks: Sometimes, neural networks are initialized with linear activation functions in certain layers. This can be a starting point, and nonlinear activations are introduced later in the training process.

Disadvantages:

  • Vanishing Gradients (in Hidden Layers): If used in hidden layers, linear activations prevent the network from learning effectively due to vanishing gradients. Backpropagation relies on gradients that change based on the activation, and linear functions have a constant gradient.
  • It’s not possible to use backpropagation as the derivative of the function is a constant and has no relation to the input x.

2.2 Sigmoid:

Imagine a gentle S-shaped curve. The sigmoid function smoothly squashes any number it receives down to a value between 0 and 1.

Theory and Math:

f(x) = 1 / (1 + e^(-x))

The larger the input (more positive), the closer the output value is 1.0, the smaller the input (more negative), the closer the output is to 0.0.

Sigmoid or Logistic Activation Function (Source: v7labs)

Use cases:

  1. Binary classification: Excellent for problems where the output must have a probability between 0 and 1 (example: predicting whether an image is cat or dog).
  2. Early Layers in Shallow Networks: Historically used in deep networks, their role in deep learning is often confused.

Advantages:

  • Probabilistic Output: Outputs can be interpreted directly as probabilities.
  • Differentiable: This ensures a well-behaved gradient during the training process, allowing the network to learn effectively.

Disadvantages:

  • Vanishing Gradients: When input values are very large or very small, the curve becomes almost flat. In these regions, the gradient is close to zero, making updates to the network’s weights extremely slow and hindering learning, especially in deep networks.
  • Not Zero-Centered: All outputs being positive can lead to slower convergence in some network architectures. The output of the logistic function is not symmetric around zero. So the output of all the neurons will be of the same sign. This makes the training of the neural network more difficult and unstable.

2.3 Hyperbolic Tangent (Tanh):

Think of the tangent function as a more dramatic version of the sigmoid. It extends the S-shaped curve vertically and shifts downwards, producing an output range between -1 and 1. Like a sigmoid, it compresses both large and small inputs to the limit of its range while maintaining great sensitivity to values ​​near zero.

Math:

f(x) = tanh(x) = (e^x — e^(-x)) / (e^x + e^(-x))

Tanh Function (Source: v7labs)

Use cases:

  • Zero-Centered Outputs Are Important: In certain neural network designs, having outputs balanced around zero can improve convergence speed.
  • Preference Over Sigmoid (Sometimes): Because tanh can mitigate some of the vanishing gradient issues of sigmoid, it can be a better choice for hidden layers in some cases.

Advantages:

  • Zero-Centered: The output of the tanh activation function is Zero centered, hence we can easily map the output values as strongly negative, neutral, or strongly positive.
  • Stronger Gradients (than sigmoid): While gradients can still become saturated, they don’t saturate as quickly as with the sigmoid.

Disadvantages:

  • Vanishing Gradients Still Exist: The problem isn’t eliminated, just reduced compared to the sigmoid.
  • More Computationally Expensive: The calculation of tanh is slightly more involved than the sigmoid calculation.

2.4 ReLU (Rectified Linear Unit):

ReLU is less like a bouncer and more like a strict traffic officer. It has a simple rule: if the input is positive, let it pass through unchanged. If the input is negative, block it completely by setting the output to zero. This creates a characteristic ‘hinge’ shape in the graph of the function.

The main catch here is that the ReLU function does not activate all the neurons at the same time.The neurons will only be deactivated if the output of the linear transformation is less than 0.

Math:

f(x) = max(0, x)

ReLU activation function (source: v7labs)

Use cases:

  • Hidden Layers in Deep Networks: ReLU is the go-to for hidden layers in most deep neural networks (think convolutional neural networks for image processing, recurrent neural networks for text/sequence data).
  • Computer Vision Tasks: Convolutional Neural Networks (CNNs) are the backbone of modern computer vision, from image classification to object detection.
  • Natural Language Processing (NLP): Recurrent Neural Networks (RNNs), including variants like LSTMs and GRUs, are commonly used for text-based tasks. While these architectures introduce their own mechanisms to combat vanishing gradients, ReLU can still be used in some layers depending on the architecture.

Advantages:

  • Sparsity: Generating outputs of zero introduces sparsity into the network, which can sometimes have beneficial effects in terms of representation.
  • No Vanishing Gradient (for positive inputs): The gradient is always 1 for positive inputs, ensuring consistent updates and faster learning.

Disadvantages:

  • “Vanishing ReLU”: Neurons with consistently negative inputs can get permanently stuck at zero output, effectively making them useless.
  • Output Isn’t Zero-Centered: This can slightly slow down convergence in some cases.

Here to address the disadvantages of ReLU, different types of Linear Units were introduced:

  • Leaky ReLU: We just went in-depth on this one! The key feature is the small, non-zero slope for negative inputs, addressing the Dying ReLU issue.
Leaky ReLU (source: v7labs): See how for negative values, the function output some positive value.
  • Parametric ReLU (PReLU): PReLU takes the flexibility of Leaky ReLU a step further. Instead of a fixed slope for negative inputs, the slope itself becomes a learnable parameter during training. Mathematically:
  • f(x) = max(ax, x) // where 'a' is learned by the network
  • This allows the network to fine-tune the slope for different neurons, potentially leading to even better performance.
  • Randomized ReLU: RReLU introduces an element of randomness to the Leaky ReLU concept. The slope for negative inputs (‘a’) is randomly sampled from a uniform distribution during training, and sometimes even changes at inference time. This can have a regularization effect, helping to prevent overfitting in some cases.
  • Exponential Linear Unit (ELU): ELU aims to push the output mean closer to zero, which can accelerate learning. It uses a smooth, curve-like function for negative inputs.
ELU (source: v7labs)

2.5 Softmax:

Imagine your neural network is a fancy coffee expert, tasked with deciding not just whether a picture is of coffee, but the exact type: latte, espresso, cappuccino. Softmax steps in as a translator, taking the raw outputs from your network and converting them into a probability distribution across all those choices.

It takes each raw output from your network (let’s call them ‘z’) and raises the base ‘e’ (the mathematical constant, roughly 2.718) to the power of that output (e^z). This makes all outputs positive and amplifies differences between them. It then sums up all those exponentiated values and divides each individual exponentiated output by this sum. As a result, each output now represents a probability between 0 and 1, and all probabilities add up to 100%.

Math:

Softmax(z_i) = e^(z_i) / (sum of e^(z_j) for all j)

Probability graph for Softmax (source: v7labs)

Use cases:

  • Output Layer for Multi-Class Classification: When your neural network has to choose among several categories (cat, dog, bird, etc.), Softmax can be really helpful. It ensures the final outputs are valid probabilities, making your network’s prediction interpretable.

Advantages:

  • Clear Probabilities: Softmax makes your network’s confidence crystal clear. An output of 0.8 for ‘latte’ means an 80% chance that’s what’s in the picture!
  • Differentiable: Just like our other activation functions, Softmax has a smooth gradient, allowing for efficient learning through backpropagation.

Disadvantages:

  • Sensitive to Outliers: If one raw output is significantly larger than the others, Softmax might exaggerate its probability, suppressing the smaller ones.
  • Computationally a Bit Heavier: All those exponentiations can make Softmax slightly more computationally demanding than some other functions.

There we go to conclude what we learnt in theory about activation functions, their importance and various scenarios to use them. I will cover about why neural networks are difficult to train and tips for choosing the right activation function in the next part.
If you like this blog, don’t forget to give a like.

🔗Connect with me:

Linkedin: Aditya Jethani

mail: jethaniaditya7@gmail.com

--

--

Aditya Jethani

I turn data to decisions epoch by epoch. Designing, Implementing and optimizing AI/ML codes is my forte.