[DL] 4. More about Gradient Descent and Activation Functions

jun94
jun-devpBlog
Published in
4 min readMar 7, 2020

1. Recap

Please read previous chapters(Chapter 2, 3) before begin to go through this.

Stationary Points

The gradient descent tells us to move x with a small step into the negative direction of its gradient of f’(x). In most cases, however, there are several points whose gradient is 0, and there are three types of such points, (1) Minimum, (2) Maximum, (3) Saddle point. Among them, we want to find the point of the case (1).

Figure 1. From Goodfellow

Global Minimum vs Local minimum

The global minimum is the point with the absolute lowest value of f(x).

While training the learning algorithm(neural network), we work with error function E(w) which often has multiple local minima and saddle points. This makes the optimization, minimizing E(w), very challenging as the learning algorithm stops when it reaches a point that is one of global, local minima, or saddle points. Therefore it is hard for the algorithm to move out of such points.

Figure 2. From Goodfellow

So, we often settle for finding local minima x’ whose f(x’) is reasonably low. Note that this x’ doesn’t have to be the global minimum.

2. Directional Derivatives.

Intuitively, the directional derivative of f at a point x=(x0, y0) represents the rate of change of f in the direction of u with respect to time, when moving x

Figure 3. Definition of the Directional Derivative

The directional derivative has the following significant property.

Figure 4. The property and proof of the directional derivative

With this property, we can understand why we take the direction of negative signed gradient when updating the network parameters(weight w).

see here for more detail

3. More Activation Functions in the Neural Network

(1) Soft-plus function

As an alternative of sigmoid function which saturates on input x whose absolute value is larger than approximately 6, we can choose soft-plus function. As we can see in figure 5, unlike the sigmoid function whose gradients are zero on both sides of the graph(marked yellow), soft-plus has the gradient of zero only on the left side of the graph meaning that less chance of being saturated.

Figure 5. sigmoid(left) and soft-plus(right) from Goodfellow
  • C.F) Why the sigmoid function is not desirable with Mean-Squared-Error function?
see here

(2) Rectified Linear Unit(ReLU)

In modern neural networks, ReLU is the most commonly chosen activation function for hidden units.

Figure 6. Relu function h : h(a) = max(0,a)

After taking a look at the graph of ReLU function, one might notice that it is not differentiable at point a = 0. In general, an activation function should be differentiable over any points in range. In practice, however, this can not be a reason to stop us from using ReLU. Because, when we train the neural networks, the gradient descent method reduces the cost function dramatically but often it does not reach a local minima where the gradient is 0. Therefore it is acceptable to use RELU for the cost(error) function.

On top of that, in most of the popular software implementations, they set their function to return either 0 or 1 which are the left and right derivatives of ReLU at point a = 0, respectively, instead of returning undefined gradient NaN.

  • Variants of ReLU

Since Relu has a gradient of 0 in the range ∈ [ -∞,0], there are some variants of Relu which doesn’t have the gradient being 0 as in figure 7.

Figure 7. Generalized Relu h

By setting alpha 𝛂 with different values, we get different Relu functions as follows.

Absolute Value Rectification: fixes 𝛂 = -1 to get h(a) = |a|

Leaky ReLU: fixes 𝛂 to a very small value.

Parametric ReLU: treats 𝛂 as a learnable parameter

4. Learning Rate

Choosing an appropriate learning rate can be also challenging. If one chooses too small learning rate, the convergence process on the minimum point can be really slow as is depicted in (a). On the other hand, if the learning rate is too large, convergence might not happen as is shown in (b).

Figure 8. generated by steepestDescentDemo.

Therefore, we often choose a method that begins with a large learning rate and as the learning proceeds, we adopt the learning rate as it has a smaller value i.e Adam. We will talk about those methods in later chapters.

5. Derivatives of Error Functions

Figure 9. Derive the derivative of the Cross-Entropy

6. Reference

[1] Bishop. Pattern Recognition and Machine Learning. Springer, 2006

[2] GoodFellow

[3] https://en.wikipedia.org/wiki/Directional_derivative

Any corrections, suggestions, and comments are welcome

Contents of this article are reproduced based on Bishop and Goodfellow

--

--