Neural Networks Part - 3
This article is one of the series of articles on neural networks.
- Neural Networks intro
- Loss function
- Activation function (this article)
In the previous part, we have discussed the use of a loss function to calculate our model’s performance.
In this part, we’ll discuss activation functions. You must have come across these before as a function applied to the sum of the weighted sum and the bias and it becomes the output of the neuron. You must have also observed that there are multiple options available for these activation functions to choose from and some perform better than the others at a particular task.
Motivation for activation functions
Let’s see what a neural network would look like without an activation function. Consider this 4 layer network we, introduced in the first part.
The first layer is the input layer which feeds all the features into every node in layer 2. Let’s denote the output of the jᵗʰ node in the iᵗʰ layer as Aᵢⱼ.
Here the output of each neuron is calculated as the product of its weights and the outputs of the previous layer, plus its bias. It can be written mathematically as,
Here n is the number of neurons in the previous layer. W is the set of weights and b is the bias for the current neuron. Using the above equation, we can calculate the outputs of every layer based on the outputs of its previous layer. Hence, the whole network can be written as a mathematical function mapping x(features) to y(labels). Let’s see how that happens.
NOTE: Here array indices are 1 based for simplicity.
Consider layer 4, Output of layer 4 is our y. y can be expressed as.
As we can see that there is no term of x or A having any power. y can be expressed as a linear combination of x. This means that we can only approximate linear functions and our predictions will be good only if y is some linear function of x.
Problem 1: Only linear functions can be modeled. Non-linear functions cannot be modeled accurately.
We can also see that the output values have no restriction on them and can potentially have any value between -infinity to +infinity. If this happens, our network will never converge.
Problem 2: Output values of neurons are not bounded.
A solution to these problems is to pass the output of every neuron through a function that will make function non-linear and produce an output within the certain range(bounded). This function is called as an activation function and the output it will produce is called as the activation value of the neuron.
There are several options to choose from for this function. Here is a great explanation of each activation function. In practice, the most popular activation functions are ReLU(Rectified Linear Unit) and sigmoid function. Let’s have a look at one of these functions.
Sigmoid function
A sigmoid function is mathematically defined as
As we can see from its graph, it can solve our problems mentioned above. It’s an S-shaped curve and clearly non-linear which solves our first problem and also its output is bounded between 0 and 1 which solves our second problem.
Pros:
- Observe the curve in the region (-2,2). We can see that it is pretty steep. Which means that tiny changes in the values of x will change the value of y significantly. This will help in converging the network faster.
- Since the values are bounded in (0,1) range, it is a good option for the output layer when we are expecting probabilities in classification problems.
- The derivative of this function is very easy to calculate and this helps in the back propagation step(that we’ll be discussing in the future parts).
Cons:
- In the region towards either end of the function, the y values change very slowly with respect to x. The gradient(slope) in this region is very low. This leads to the “vanishing gradient” problem. This can cause the outputs of consecutive layers to be pushed even more towards the end of the graph which results in very slow convergence.
- The output values of this function are not zero-centered. Which means that the outputs are always positive. This is a bad thing because, in the backpropagation step, the gradients will be either all positive or all negative, consequently limiting the degree of freedom for the weight values.
These problems are solved by other activation functions such are tanh and ReLU. Most of the applications nowadays use ReLU as an activation function due to its simplicity and efficiency. This video by Siraj Raval is a great explanation of activation functions and which ones to use.
In the next part, we’ll have a look at the most important step in the training process of a neural network called backpropagation. Stay tuned :)