Understanding Deep Neural Networks for beginners— Part 1

Chamuditha Kekulawala
The Deep Hub
Published in
5 min read4 days ago

In the previous article we introduced artificial neural networks and trained our first deep neural networks (DNNs). But they were very shallow nets, with just a few hidden layers. What if you need to tackle a very complex problem, such as detecting hundreds of types of objects in high-resolution images?

The challenges of DNNs

Now you need to train a much deeper DNN, perhaps with more than 10 layers, each containing 100s of neurons, connected by 1000s of connections. This would not be a piece of cake:

  1. You would face the vanishing gradients problem (or the exploding gradients problem) that makes lower layers pretty hard to train.
  2. You might not have enough training data for such a large network, or it might be too difficult to label.
  3. Training might be extremely slow.
  4. A model with millions of parameters would severely risk overfitting the training set, especially if there aren’t enough training instances, or they are too noisy.

In this article, we’ll go through all of these problems and present techniques to solve each of them.

Vanishing/Exploding Gradients Problems

As we discussed in the previous article, the backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way. Once the algorithm has computed the gradient of the cost function with regards to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step.

Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is called the vanishing gradients problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem, which is mostly encountered in Recurrent Neural Networks (RNNs).

More generally, DNNs suffer from unstable gradients; different layers may learn at widely different speeds. Although this unfortunate behavior has been empirically observed for quite a while and was one of the reasons why DNNs were mostly abandoned for a long time, it is only around 2010 that significant progress was made in understanding it. A paper titled “Understanding the Difficulty of Training Deep Feedforward Neural Networks” found one of the causes to be the combination of the logistic sigmoid activation function, and the weight initialization technique that was most popular at the time.

They showed that with this activation function + initialization technique, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This is actually made worse by the fact that the logistic function has a mean of 0.5, and not 0.

Looking at the logistic activation function, you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative approximating to 0:

So when backpropagation kicks in, it has virtually no gradient to propagate back through the network, and little gradient that exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.

Glorot and He Initialization

We can significantly reduce this problem when the signal flows properly in both directions: forward when making predictions, and in reverse when backpropagating gradients. We don’t want the signal to die out, nor do we want it to explode and saturate. For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs.

Here’s an analogy: if you set a microphone amplifier’s knob too close to zero, people won’t hear your voice, but if you set it too close to the max, your voice will be saturated and people won’t understand what you are saying. Now imagine a chain of such amplifiers: they all need to be set properly in order for your voice to come out loud and clear at the end of the chain. Your voice has to come out of each amplifier at the same amplitude as it came in.

We also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons (these numbers are called the fan-in and fan-out of the layer), but a good compromise that has proven to work very well in practice is: the connection weights of each layer must be initialized randomly as follows;

Normal distribution with mean 0 and variance:

Uniform distribution between −r and + r:

Here fanₐᵥ𝓰 = fanᵢₙ + fanₒᵤₜ /2. This initialization strategy is called Glorot initialization. If you just replace fanₐᵥ𝓰 with fanᵢₙ, you get the LeCun initialization strategy proposed in 1990. It is equivalent to Glorot initialization when fanᵢₙ = fanₒᵤₜ. It took over a decade for researchers to realize just how important this trick really is. Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the current success of Deep Learning.

There are similar strategies for different activation functions. The initialization strategy for the ReLU activation function (and its variants, like the ELU activation) is called He initialization. Another activation function called SELU should be used with LeCun initialization (preferably with a normal distribution).

Here’s a summary:

In the next article we’ll talk about non-saturating activation functions! Thanks for reading 🎉

--

--