Paula Errasti
Bedrock — Human Intelligence
6 min readAug 9, 2022

--

Have you ever stopped to think about why ReLU function works so well?

In our previous article in this series we presented two techniques to improve the optimisation of a neural network, both based on adaptive techniques: adapting the learning rate α and adapting the gradient descent formula by including new terms in the equation.

In this article we are going to focus on another essential element of Artificial Neural Networks (ANNs), the activation function, and we will see what limitations it has and how they can be overcome.

The activation function

We already mentioned the activation function in our introduction article to ANNs, but what does it do specifically and how can we know which one to use?

The nodes of an ANN are characterised by performing an operation on the information they receive through this activation function, which we’ll denote by φ. Thus, the output of a neuron j, yⱼ, is the result of the activation function φ applied to the information that neuron j receives from its m predecessor neurons:

Conventionally this φ function was equal to the sigmoid function or the hyperbolic tangent, but these functions have several drawbacks that can make our optimisation worse.

Why are these functions not always perfect?

The sigmoid activation function takes the input v and transforms it into a value between 0 and 1, while the hyperbolic tangent does it between -1 and 1.

Left: Plot of sigmoid function. Right: Plot of tanh function

The problem is that the neurons can saturate; arbitrarily large input values will always return 1, and very small values, 0 — or -1 in the case of tanh . Therefore, these functions are sensitive to changes only when vⱼ is very close to 0.5 and 0 respectively. Once the neurons saturate, it becomes very difficult for the algorithm to adapt the weights to improve the performance of the model.

In addition, deep networks — those with many hidden layers — can be difficult to train because of the way the gradients of the first layers are related to the ones in the final layers. It is possible that the magnitude of the error decreases exponentially with each additional layer that we add, and this means that the algorithm won’t know how to adjust the parameters for improving the cost function. This problem is the well-known vanishing gradient problem.

Understanding the vanishing gradient problem

Let us look at this problem in a little more detail. Suppose we have a network with m hidden layers that has a single neuron in each layer, and let us note the weights between layers by ω⁽¹⁾, ω⁽²⁾, …, ω⁽ᵐ⁾. Suppose too that the activation function of each layer is the sigmoid function and that the weights have been randomly initialised so that they have an expected value equal to 1. Let x be the input vector, y⁽ⁱ⁾ the hidden values of each layer and φ⁽ᵗ⁾’(v⁽ᵗ⁾) the derivative of the activation function in the hidden layer t. From the backpropagation algorithm we know the expression:

What happens if we explicitly calculate the value of the derivative φ’(x) for the case of a sigmoid function?

We can see that the value of φ’(x) reaches, at maximum, 0.25. As the expected absolute value of ω⁽ᵗ⁺¹⁾ is 1, it follows from the backpropagation equation that each weight with each update will cause the value of ∂J/∂y⁽ᵗ⁾ to be less than 0.25*( ∂J/∂y⁽ᵗ⁺¹⁾). Therefore, when we move backwards r layers, we will find that the value is less than 0.25ʳ. If r equals, say, 10, the gradient updates fall up to 10⁻⁶ of the value they originally had. Consequently, the first layers of the network receive updates much smaller than those closest to the output layer.

Left: Plot of the derivative of the sigmoid function. Right: Plot of the derivative of the tanh function

This implies that the parameters of the last layers undergo large variations with each update, while those of the first layers do not. As a result, it is common to find ourselves in a situation where, even if we train for a long time, it is not possible to reach the optimum. Usually, unless we initialise each weight of each junction between neurons of different layers so the product ω ⋅ φ’ is exactly one, we will have problems.

The ReLU function

This is nowadays an open problem and we know different solutions that can be applied to this particular issue. One which has been proposed in recent years is to use the Rectifier Linear Activation function Unit or, as it is known in the Data Science circle, RELU function. It is defined by the expression:

We see that the function acts as the identity if the input is positive and if not, it returns zero. With this function the vanishing gradient problem occurs in fewer situations since most neurons have a gradient equal to one. When ReLU was first used — between 2009 and 2011 — it provided better performance in networks that had previously been trained with the sigmoid or hyperbolic tangent functions.

Some benefits of the ReLU to highlight are the following:

Computational simplicity: Unlike the other functions, it does not require the calculation of an exponential function.

Representational sparsity: A great advantage of this function is that it is able to return an absolute zero. This allows hidden layers to contain exactly one or more nodes “off”. This is called a sparse representation, and it is a desirable property since it speeds up and simplifies the model.

However, recent work has found that the use of this function can cause another type of problem: The death of certain neurons. Imagine the situation where the input of a neuron is always positive and the weights, by chance, have been initialised with negative values. Then the output of that neuron will always be zero and we will lose the information it receives. To solve this, variants such as the leaky ReLU are currently being studied:

Conclusion

When we are training a neural network that is not converging it can be due to this vanishing gradient problem, which is especially relevant when we have several hidden layers. It is important to keep in mind that machine learning algorithms are based on mathematical rules, — it’s not a magic trick and if we pull back the curtain we can see how they work — , sometimes we can improve their training simply by understanding what is going on behind the scenes.

A possible solution is the one proposed in this article, but there are others that can be tried. There will be trade-offs, for example there may be some that can give better results, but at the cost of being less efficient. We must evaluate each problem and try various alternatives. One size does not fit all.

References

[1] Bengio, Y. 2012. Practical Recommendations for Gradient-Based Training of Deep Architectures. arXiv:1206.5533v2.

[2] R. Grosse, University of Toronto. (n.d.). Exploding and Vanishing Gradients.

[3] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.

--

--