Neural Network 04 — Activation Functions

5 min readNov 3, 2023

Welcome to my lesson #4. If you were following with me on my previous lessons, you might already have a solid and detailed understanding about intuitions behind Neural Network Representation.

If you do not have the basic knowledge about Neural Network or if you interested to read them please visit following links.

If you are good to go, let’s get started. 😎👍

What is Activation Function?

The reason for this immense popularity for Neural Network and Deep Learning in AI work is, it’s accuracy and outstanding ability to recognize patterns in data by itself. Activation function plays a crucial role behind learning these complex patters hidden inside large datasets.

If you recall this diagram we studied previously, you will see the second half of the node represents applying non-linearity to the weighted sum of the input and producing the output of that node.

Here, above diagram shows specifically the Sigmoid activation function which is one of many activation functions which we are going to learn very soon. 👇

As we know, if we consider two layers of a Neural Network, Forward Propagation computes following equations.

Sigmoid activation function (Logistic function)

z is weighted sum of inputs and bias.
Output range between 0 and 1
There is a linear section around z = 0

Sigmoid function allows it’s outputs to squash values between 0 and 1. Therefore Sigmoid function maps any real number to a value between 0 and 1. (Seems like probability isn’t it? 😉)

Sigmoid has smooth derivatives which allows efficient training with gradient based optimization algorithms. But it also has the problem of vanishing gradients when the network becomes deeper and deeper.

Sigmoid activation function is usually used for binary classification problems.

Tanh activation function (Hyperbolic Tangent function)

z is weighted sum of inputs and bias.
Output range between -1 and 1
There is a linear part around z = 0

tanh is a shifted version of Sigmoid. However tanh works better than Sigmoid. Tanh mitigates the vanishing gradient problem to some extent Sigmoid is suffering from.

The mean of the values come out from tanh activation function is closer to 0. If we center our training data values and have a mean closer to 0, tanh will learn for the next layer much easier.

However, if the output layer expects values between 0 and 1 (eg: binary classification) we have to use Sigmoid at the output layer.

Drawbacks of Sigmoid and tanh
For very large z and very small z, the slope is close to 0. That could slowdown gradient descent.

Rectified Linear Unit (ReLU) activation function 💜

The derivative is 1 when z > 0
The derivative is 0 when z < 0

The derivative when z = 0 is not well defined. But it’s very close to 0. So we can consider it as 0 as well.

If you check ReLU function, you can see that output a for all negative z values will be 0. That means for all negative z values, the derivative (slope) will also be 0. But in practice, majority of outputs will be greater than 0 (z > 0). So, learning process can still be quite fast for most of the training samples.

There is another version of ReLU called Leaky ReLU. The idea behind Leaky ReLU is to avoid a being 0 for negative z values.

This works better that ReLU. But both works really well and ReLU is the most common and trusted one to be used.

Why do we need a non-linear activation function? 🤔

Lot of people have this question in their heads specially when getting into Neural Network. So, Why do we really need non-linear activation function, and why is it a crucial component in neural network?

The answer to this question can have following ideas.

Data around us always doesn’t have a linear relationship
Adding non-linearity to the network
Learning complex patterns
Making the gradient flow smooth

So, I will explain the answer mathematically, which is always a good way to understand these things. 😎

We have following four equations related to our two-layer neural network.

However…
when we predict Regression problem, we might have to use Linear activation function as output layer. In that case ŷ ∈ ℝ, -∞ ≤ ŷ ≤ ∞

Derivatives of Activation Functions

(Back propagation)
Note that the actual calculations of derivatives are not done here. You can try that part yourselves if necessary. The math calculations of derivatives are not super important.

Sigmoid activation function

Pay attention to the colored areas of the graph and related section bellow to the graph.

Tanh activation function