Activation Functions Cheat Sheet for beginners in Machine Learning

Jeremy Verdo
8 min readJun 20, 2024

--

What Are Activation Functions?

At its core, an activation function in a neural network decides whether a neuron should be activated or not. This function takes the input signals (weighted sum of inputs) and processes them to determine the output. In simpler terms, it’s like a gatekeeper that controls the flow of information, ensuring that only the important signals get through, while filtering out the noise.

Without activation functions, a neural network would merely be a linear regression model, incapable of capturing the intricate, non-linear relationships within the data. Activation functions introduce non-linearity into the network, allowing it to learn and perform more complex tasks.

Why Do We Use Activation Functions?

To illustrate the importance of activation functions, let’s use an analogy. Think of a neural network as a sophisticated chef creating a gourmet dish. The ingredients (input data) are mixed and processed through various stages (layers of the network), and the activation function is like the tasting process. It helps the chef decide which flavors to enhance and which to suppress, resulting in a well-balanced, delicious final product. Similarly, activation functions help the network emphasize critical features and diminish less relevant information, leading to accurate and effective outputs.

Below is a comprehensive cheat sheet covering various activation functions, their properties, and typical use cases:

1. Sigmoid / Logistic Function

The sigmoid function is a widely used activation function that transforms input values into a smooth, S-shaped curve ranging between 0 and 1. This function is particularly useful in scenarios where outputs need to be constrained within this range.

Use Cases

  • Output Layer for Binary Classification: The sigmoid function is perfect for binary classification problems where the output needs to represent a probability, effectively distinguishing between two classes.
  • Intermediate Layers in Early Neural Networks: It was commonly used in the hidden layers of early neural network architectures. However, issues like the vanishing gradient problem, output saturation, and slow convergence have led to the preference for more efficient activation functions like ReLU and its variants.

Equation

The sigmoid function squashes input values to a range of (0, 1), is differentiable, and introduces non-linearity, making it suitable for binary classification tasks.

Pros

  • Useful for probabilistic outputs.
  • Simple and widely understood.

Cons

  • Suffer from vanishing gradient problem.
  • Not zero-centered (causes gradient issues during backpropagation).

2. Hyperbolic Tangent (Tanh) Function

The tanh function, short for hyperbolic tangent, is another popular activation function in neural networks. It transforms input values into a smooth, S-shaped curve ranging from -1 to 1, making it particularly useful in scenarios where outputs need to be centered around zero. However, the tanh function is still susceptible to the vanishing gradient problem for very high or low input values.

Use Cases

  • Hidden Layers in Neural Networks: The tanh function is often used in the hidden layers of neural networks because it provides better gradient flow than the sigmoid function, helping to mitigate the vanishing gradient problem.
  • Normalization: Since the outputs of the tanh function are centered around zero, it helps normalize the inputs to the next layer, making learning more efficient.

Equation

The tanh function squashes the input to a range between -1 and 1, is differentiable, and introduces non-linearity.

Pros

  • Zero-centered (helps with backpropagation).
  • Outputs can be positive or negative.

Cons

  • Suffer from vanishing gradient problem.

3. Rectified Linear Unit (ReLU)

The ReLU (Rectified Linear Unit) function is a widely used activation function that transforms input values by outputting the input directly if it is positive, and zero if it is negative. This function is particularly useful in scenarios where fast and efficient training is required.

Use Cases

  • Hidden Layers in Deep Neural Networks: The ReLU function is perfect for hidden layers in deep neural networks due to its simplicity and efficiency, which allows for faster and more effective training.
  • Sparse Activation: ReLU promotes sparsity in the network by deactivating neurons for negative input values, leading to improved performance and reduced computational complexity.

However, while ReLU helps mitigate the vanishing gradient problem and accelerates convergence, it can suffer from the “dying ReLU” problem where neurons become inactive if they consistently output zero. Despite this, the benefits of ReLU and its variants make them the preferred choice in most modern neural network architectures.

Equation

The ReLU function has a range of [0, ∞), is differentiable except at zero, introduces non-linearity, and outputs positive values while zeroing out negatives.

Pros

  • Efficient computation.
  • Mitigates vanishing gradient problem.
  • Sparsity (many neurons output zero).

Cons

  • Suffer from dying ReLU problem (neurons can get stuck at zero).

4. Leaky ReLU

The Leaky ReLU function is a variant of the popular ReLU activation function that allows a small, non-zero output for negative input values. This function is particularly useful in scenarios where mitigating the “dying ReLU” problem is crucial.

Use Cases

  • Hidden Layers in Deep Neural Networks: The Leaky ReLU function is perfect for hidden layers in deep neural networks as it prevents neurons from becoming inactive by allowing a small gradient for negative inputs.
  • Improved Gradient Flow: By addressing the issue of zero gradients for negative inputs, Leaky ReLU ensures better gradient flow during backpropagation, leading to more efficient and robust training.

However, while Leaky ReLU improves upon the traditional ReLU by reducing the likelihood of inactive neurons, it still retains the benefits of fast and effective training, making it a preferred choice in various modern neural network architectures.

Equation

The Leaky ReLU function has a range of (-∞, ∞), is differentiable, introduces non-linearity, and allows a small, non-zero gradient for negative input values. The value for alpha in the Leaky ReLU function is typically a small positive number, often set to 0.01.

Pros

  • Solves dying ReLU problem.
  • Retains the benefits of ReLU.

Cons

  • Introduces an additional hyperparameter (α).

5. Exponential Linear Unit (ELU)

The Exponential Linear Unit (ELU) is an activation function that transforms input values using an exponential function for negative inputs and a linear function for positive inputs. This function is particularly useful in scenarios where faster and more accurate learning is required.

Use Cases

  • Hidden Layers in Deep Neural Networks: The ELU function is ideal for hidden layers in deep neural networks because it helps in reducing the vanishing gradient problem and speeds up learning by maintaining mean activations close to zero.
  • Improved Learning Dynamics: ELU provides faster and more accurate learning compared to ReLU by introducing smooth and non-zero outputs for negative values, leading to improved gradient flow and convergence.

However, while ELU addresses some of the issues found in ReLU, such as the dying ReLU problem, its computational complexity is higher due to the exponential function.

Equation

The ELU function has a range of (-α, ∞), is differentiable, introduces non-linearity, and smooths negative values exponentially.

Pros

  • Reduces the vanishing gradient problem.
  • Improves learning characteristics.

Cons

  • Computationally expensive.
  • Introduces an additional hyperparameter (α).

6. Swish

The Swish function is a newer activation function that transforms input values by multiplying them with their sigmoid activation, resulting in a smooth, non-linear output that ranges between negative and positive values. This function is particularly useful in scenarios where maintaining a balance between simplicity and performance is crucial.

Use Cases

  • Hidden Layers in Neural Networks: The Swish function is perfect for hidden layers in neural networks due to its ability to preserve information flow and promote smoother gradients, leading to improved training performance and convergence.
  • Deep Learning Models: Swish is increasingly used in deep learning models where traditional activation functions like ReLU may fall short, providing better accuracy and efficiency in complex tasks.

However, while Swish addresses some limitations of older activation functions like the vanishing gradient problem and output saturation, its computational complexity can be higher due to the inclusion of the sigmoid component. Despite this, the balanced performance of Swish makes it a strong candidate in modern neural network architectures.

Equation

The Swish function has a range of (-∞, ∞), is differentiable, introduces non-linearity, and produces a smooth output by retaining the input and applying a sigmoid.

Pros

  • Outperforms ReLU in deep networks.
  • No vanishing gradient problem.

Cons

  • Computationally more expensive than ReLU.

7. Softmax

The softmax function is a widely used activation function that transforms input values into a probability distribution, where each output value is between 0 and 1 and the sum of all outputs equals 1. This function is particularly useful in scenarios where the model needs to predict a multi-class probability distribution.

Use Cases

  • Output Layer for Multi-Class Classification: The softmax function is perfect for multi-class classification problems where the output needs to represent a probability distribution over multiple classes, effectively allowing the model to distinguish between more than two classes.
  • Neural Network Architectures: It is commonly used in the output layer of neural networks designed for tasks like image classification, natural language processing, and other applications where a clear, probabilistic interpretation of the output is required.

The softmax function provides a clear and interpretable output, making it essential for multi-class classification tasks. However, it is computationally more intensive than simpler activation functions and is therefore typically reserved for the output layer rather than hidden layers.

Equation

The softmax function is differentiable, introduces non-linearity, has a range of (0, 1), and converts a vector of values into a probability distribution.

Pros

  • Useful for multi-class classification.

Cons

  • Sensitive to outliers.

Conclusion

Activation functions are a fundamental component of neural networks, playing a critical role in enabling them to model complex, non-linear relationships within data. By introducing non-linearity, these functions allow neural networks to perform a wide range of sophisticated tasks, from image recognition to natural language processing. Each activation function has unique characteristics, advantages, and drawbacks, making them suitable for different scenarios and architectures.

As we explored, the sigmoid and tanh functions, though foundational, have limitations such as the vanishing gradient problem. In contrast, ReLU and its variants like Leaky ReLU and ELU address these issues, promoting faster and more efficient training. The newer Swish function offers a promising balance of performance and complexity, while the softmax function remains indispensable for multi-class classification tasks.

Choosing the right activation function depends on the specific requirements of your neural network model and the problem at hand. By understanding the strengths and limitations of each function, you can make informed decisions that enhance your network’s performance and accuracy.

Further Readings

--

--