Deep Learning Best Practices: Activation Functions & Weight Initialization Methods — Part 1

Niranjan Kumar
May 4 · 17 min read

One of the reasons that Deep learning has become more popular in the past decade is better learning algorithms which have to lead to faster convergence or better performance of neural networks in general. Along with better learning algorithms, Introduction of better activation functions, and better initialization methods help us to create better neural networks.

Note: This article assumes that the reader has a basic understanding of Neural Network, weights, biases, and backpropagation.

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — PadhAI.

In this article, we discuss some of the commonly used activation functions and weight initialization methods while training a deep neural network. To be more specific, we will be covering the following.

Under Activation functions,

Under Weight Initialization Methods,

You can either go with the flow of the article or click on any of the above hyperlinks to jump into that section.

The main reason that we use a Multi-layered network of neurons is that they are able to handle non-linearly separable data. The layers present between the input and output layers are called hidden layers. The hidden layers are used to handle the complex non-linearly separable relations between input and the output by introducing some sort of non-linear function called activation function.

PS: If you are not familiar with the concepts related to Multi-layered network or Feed Forward Neural Networks, No need to spend time searching in google for revelent article, kindly go through my recent blog post on feed-forward neural networks. Link is at the end of this article.

Why are Activation Functions Important?

Before we discuss different activation function, let’s see why activation functions are important in deep neural networks. The activation function is the non-linear function that we apply over the input data coming to a particular neuron and the output from the function will be sent to the neurons present in the next layer as input.

Let’s assume we have a simple network of neurons with two hidden layers (in blue) and each hidden layer has 3 sigmoid neurons. We have three inputs going into the network and there is one single neuron in the output layer.

For each of these neurons, two things will happen

  1. Pre-activation represented by ‘a’: It is a weighted sum of inputs plus the bias.
  2. Activation represented by ‘h’: Activation function is non-linear function.

The pre-activation at each layer is the weighted sum of the inputs from the previous layer plus bias. The mathematical equation for pre-activation at each layer ‘i’ is given by,

What happens if there are no non-linear activation functions in the network?

Imagine that instead of applying the non-linear activation function, I will just apply a linear transformation to the output of pre-activation. Since we are not applying any non-linear transformation to the output of any neuron present in the network, the final output of the network will just be equal to the multiplication of all the weights in the network with the input.

Even if we use very very deep neural networks without the non-linear activation function, we will just learn the ‘y’ as a linear transformation of ‘x’. It can only represent linear relations between ‘x’ and ‘y’. In other words, we will be constrained to learning linear decision boundaries and we can’t learn any arbitrary non-linear decision boundaries.

More importantly, Universal Approximation Theorem which talks about the representational power of deep neural networks doesn’t hold. The representational power of a deep neural network is due to its non-linear activation functions. This is why we need activations functions — non-linear activation function to learn the complex non-linear relationship between input and the output.

Some of the commonly used activation functions,

  • Logistic
  • Tanh
  • ReLU
  • Leaky ReLU

Let’s discuss the merits and demerits of each of these functions.

Logistic Function

The sigmoid function is a non-linear function that takes a real-valued number as an input and compresses all its outputs to the range of [0,1] which can be interpreted as a probability. There are many functions with the characteristic of an “S shaped curve known as sigmoid functions. The most commonly used function is the logistic function. In the logistic function, a small change in the input only causes a small change in the output as opposed to the stepped output. Hence the output is much smoother than the step function output.

The mathematical form of the logistic function and its derivative is given by,

The sigmoid function is continuous and easily differentiable hence we can easily use a logistic function to update weights during backpropagation. However logistic function has become less popular in recent days because of its drawbacks:

Vanishing Gradient — Saturated Sigmoid Neurons:

A logistic neuron is said to be saturated when it reaches its peak value either maximum or minimum. In the logistic function mathematical formula, when you plug in a large positive number logistic function becomes 1 and a large negative number logistic function becomes 0.

When the function has reached either a maximum or minimum value, we say that the logistic function has saturated. As a result, the derivative of the logistic function would be equal to zero at the saturated point. To understand the implications of the saturated logistic neuron, we will take a simple neural network as shown below,

In this thin but deep network, suppose you are interested in computing the gradient of the weight w₂ with respect to the loss function. The pre-activation and post-activation for the neuron present in the third hidden layer is given by,

Assuming that you already know the chain rule for computing the gradients of the weight parameter,

If our the post-activation value ‘h₃’ is either close to 0 or 1 then the gradient in our chain rule would be equal to zero. As a result, there will be no updating of weights because values of gradient would become equal to zero (or close to zero) that leads to the vanishing gradient problem.

Saturated neurons cause the gradients to vanish

Why would the logistic function saturate?

We have seen that the saturated logistic function would cause a problem but why would the logistic function saturate?. The sigmoid function takes the output of pre-activation which is nothing but the weighted sum of inputs along with the respective weights plus bias.

When would the logistic function saturate?. If the aggregation is a large positive number or a large negative number, that means one of two things could happen either the input ‘xᵢ’ is very large or the weight ‘wᵢ’ is very large. We know that before feeding data into any neural network we normalize them so the input will be in the range of 0 and 1.

Suppose if I happen to initialize all the weights to a large value (large positive or large negative), the effective sum of aggregation would become very large. We know that if the aggregation becomes very large (large positive or large negative) logistic function would saturate. At any point either during training or weight initialization, if the weights become very large positive or very large negative values then summation could blow up either in the positive or negative direction, in that case, the logistic function can hit saturation.

Zero centered functions

Logistic function is not zero-centered

The logistic function is not zero-centered what I mean by that is, that the value of logistic function always lies between 0 and 1. So the average cannot be 0, it will always be a value above zero. The zero centered function would be a function where its output some times would be greater than 0 and less than 0.

Let’s see what’s the problem with a function that is not zero centered by taking a simple neural network. For this discussion, consider only the final layer and the second last layer in the network. The pre-activation for the second last layer is given by ‘a₃’ equal to the weighted sum of inputs.

Now to apply the gradient descent rule and the update the parameters of the neuron present in the second last layer, we need to compute the gradient for ‘dw₁’ and ‘dw₂’ with respect to the loss function. Assuming that you know the chain rule,

The terms in red are common for both the weights and the chain rule changes for blue terms. The value of a₃ is given by,

By Substituting the ‘a₃’ in the above chain rule, we would get the following equation.

Remember that h₂₁ and h₂₂ are outputs from the logistic function so both of them would always be positive. Suppose the red quantity in the above figure is negative then the both these gradients would be negative similarly if the red quantity is positive then both these gradients would be positive. Essentially, either all the gradients connected the same neuron in a layer are positive or all the gradients in a layer are negative.

This restricts the possible update directions i.e… gradients can move only in the first quadrant and the third quadrant. What this means is that because the algorithm is not allowed to take certain movements or directions. As a result, it would take a lot of time to converge.

Computationally expensive

The logistic function is computationally expensive because of the exponential term in the function.

Tanh Function

Tanh is a non-linear activation function that compresses all its inputs to the range [-1, 1]. The mathematical form of Tanh activation function and it’s derivative is given below,

Tanh is similar to the logistic function, it saturates at large positive or large negative values, the gradient still vanishes at saturation. But Tanh function is zero-centered so that the gradients are not restricted to move in certain directions. Like sigmoid, Tanh is also computation expensive because of eˣ.
In practice, Tanh is preferred over logistic function.

ReLU — Rectified Linear Unit

ReLU a non-linear activation function was introduced in the context of a convolution neural network. ReLU is not a zero-centered function, unlike the Tanh function.

If the input is positive then the function would output the value itself, if the input is negative the output would be zero. In fact, we can combine two ReLU units to recover a piecewise linear approximation of a logistic function.

Advantages:

  • Doesn’t saturate in the positive region, avoids the vanishing gradient problem
  • Computationally efficient.
  • In practice, it converges much faster than logistic/Tanh.

Problem with ReLU — Dead Neurons

Let’s take a simple neural network, where the value of the pre-activation in the first layer h₁ is equal to applying the ReLU function on top of a₁.

The value of h₁ is given by,

Let’s assume that the parameter ‘b’ takes on a large negative value due to a large negative update at some point while training, then the value of a₁ changes to,

If we apply the ReLU function on top of a₁ which is less than zero then the output h₁ would also be zero that means the neuron would output zero.

Not only the output will be equal to zero, during the backpropagation, the gradient dh₁ value will evaluate to zero. The weights w₁, w₂, and bias b₁ will not get updated because there will be a zero term in the chain rule and the neuron will stay dead forever. This problem is known as the Dying ReLU.

That means no gradients will flow back and all the weights connected to that neuron will not get updated. In practice when you train a network with ReLU, you will observe that a large fraction of neurons would die. To avoid this problem we can use other variants of ReLU like Leaky ReLU or we can initialize the weights and bias to a large positive value. By initializing the weights to a large positive value even if this large negative gradient flows through the network there is still a chance that it will not become a large negative value and hence it will not mess up the network.

Leaky ReLU

Leaky ReLU is a variant of ReLU. In Leaky ReLU instead of producing zero for inputs less than zero like in ReLU, it will just produce a very small value proportional to the input i.e… 0.01x. The mathematical form of Leaky ReLU activation function and it’s derivative is given below,

Because of the small value (0.01) proportional to the input for the negative values, the gradient would not saturate. If the input is negative gradient would be 0.01, this ensures neurons doesn’t die.

Advantages of Leaky ReLU:

  • Doesn’t saturate in the positive or negative region
  • Neurons will not die (0.01x ensures that at least small gradient will flow through)
  • Easy to compute
  • Close to zero-centered outputs

Weight Initialization

When we are training deep neural networks, weights and biases are usually initialized with random values. In the process of initializing weights to random values, we might encounter the problems like vanishing gradient or exploding gradient. As a result, the network would take a lot of time to converge. In this section, we will discuss some of the best weight initialization techniques.

Why not initialize all weights to zero?

Let’s look at the naive method of initializing weights i.e…initializing all the weights to zero. Once again I have taken a simple neural network shown below and let’s focus only on pre-activation terms a₁₁ and a₁₂.

We know that pre-activation is equal to the weighted sum of inputs and biases for simplicity ignore bias term in the equation.

If all our weights are initialized to zero, then the above two equations would evaluate to zero. That means all the neurons in the first layer will get the same post activation value irrespective of the non-linear activation function used.

Because every neuron in the network computes the same output, they will also have the value of the same gradient flowing back during backpropagation and undergo the exact same parameter updates.

In other words, the weights started off with the same value, they are going to get the same gradient update and then they remain at the same value even after getting the update using backpropagation. Once you initialize the weights to zero, in all subsequent iterations the weights are going to remain the same (they will move away from zero but they will be equal), this symmetry will never break during the training. Hence weights connected the same neuron should never be initialized to the same value. This kind of phenomenon is known as symmetry breaking problem.

The key takeaways from our discussion on symmetry breaking problem,

  • Never initialize all the weights to zero
  • Never initialize all the weights to the same value

Random Initializing — Small Weights

We have seen that initializing weights with zeros and equal values is not good, let’s see whether initializing weights randomly but with small weights is good or not!.

Let’s assume that we have a deep neural network with 5 layers and the values of activation output for these 5 layers (left to right) is given below,

We can see from the above figure that the output from Tanh activation function, in all the hidden layers, expect from the first input layer is very close to zero. That means no gradients will flow back and the network won’t learn anything, the weights won’t get the update at all. Here, we are facing the vanishing gradients problem. This problem is not only specific to Tanh activation function, but it can also be observed with other non-linear activation functions as well.

In the case of a sigmoid (logistic) function, the output values are centered around 0.5 and the value of a logistic function at 0.5 is equal to 0. Hence logistic function also causes vanishing gradients problem.

Random Initializing — Large Weights

Let’s try large random values for initializing weights and analyze whether it will cause any problem or not.

If the weights are large, the post-activation sum (a₁₁ and a₁₂) could take on a large value especially if there are more input neurons.

If we pass the large aggregation value either to a logistic or tanh activation function, the function would hit saturation. As a result, there will be no updating of weights because values of gradient would be zero (or close to zero) that leads to the vanishing gradient problem.

Xavier initialization

So far we have seen that initializing the weights to zero is not good and initializing to random large or small values also not a good method. Now, will discuss some of the standard initializing methods.

If you look at the pre-activation for the second layer ‘a₂’, it is a weighted sum of inputs from the previous layer(output for post-activation from the first layer) and the bias. If the number of inputs to the second layer is a very large quantity, in that case, there is a possibility that the aggregation ‘a₂’ would blow up. So it makes sense that these weights should be inversely proportional to the number of input neurons present in the previous layer.

If the weights are inversely proportional to the number of input neurons, in case the number of input neurons are very large which is common in a deep neural network, all these weights will take on small values because of the inverse relationship. Hence the net post-activation aggregation would be very small. This method of initialization is known as Xavier Initialization.

Xavier Initialization initializes the weights in your network by drawing them from a distribution with zero mean and a specific variance,

As a rule of thumb, we use Xavier Initialization for Tanh and logistic activation functions. Don’t shoot the messenger here, I am not going into details on how to derive this formula and what are the assumptions for deriving this equation. If you are interested in going deeper into this concept, read this awesome blog post by andy

He (He-et-al) Initialization

Pronounced as Hey Initialization. Introduced in 2015 by He-et-al, and is similar to Xavier Initialization. In He-Normal Initialization, weights in your network are drawn from a normal distribution with zero mean and a specific variance factor multiplied by two,

Numpy implementation of He-Intilization,

He-Initialization is mostly used in the content of ReLU and Leaky ReLU activations. If you want to check out the proof of He-Initialization go through the blog from mc.ai.

Best Practices

As there is no rule written in stone for choosing the right activation function and weight initialization methods. we will just go by the rule of thumb,

  • Xavier initialization mostly used with tanh and logistic activation function
  • He-initialization mostly used with ReLU or it’s variants — Leaky ReLU.

Conclusion

In this post, we discussed the need for non-linear activation functions in deep neural networks and then we went on to see the merits & demerits of commonly used non-linear activation functions. After that, we looked at different ways of how not to initialize the weights. We then discussed two standard initialization methods. Finally, we have seen the industry best practices on the usage of activation functions and weight initialization methods.

Recommended Reading:

In my next post, we will discuss how to implement these activation functions & weight initialization methods and analyze, how the choice of activation function and weight initialization method will have an effect on accuracy and the rate at which we reduce our loss in a deep neural network using a non-linearly separable toy data set. So make sure you follow me on medium to get notified as soon as it drops.

Until then Peace :)

NK.


Niranjan Kumar is Retail Risk Analyst at HSBC Analytics division. He is passionate about deep learning and AI. He is one of the top writers at Medium in Artificial Intelligence. Connect with me on LinkedIn or follow me on twitter for updates about upcoming articles on deep learning and Artificial Intelligence. Currently, I am looking for opportunities either full-time or freelance projects, in the field of Machine Learning and Deep Learning. Feel free to drop me a message on LinkedIn or you can reach me through email as well. I would love to discuss.

References:

https://medium.com/@sakeshpusuluri123/activation-functions-and-weight-initialization-in-deep-learning-ebc326e62a5c

Data Driven Investor

from confusion to clarity, not insanity

Niranjan Kumar

Written by

Retail Risk Analyst at HSBC Analytics. ML and DL Enthusiast. Freelancer. Writer at hackernoon.com & towardsdatascience|| connect & fork @ Niranjankumar-c

Data Driven Investor

from confusion to clarity, not insanity