Tuning Neural Networks Part II

Considerations For Initialization

6 min readOct 30, 2021

This series aims to provide a deep understanding of neural networks by examining how tuning parameters can affect what and how they learn. The content assumes some prior knowledge of neural networks which you can acquire by reading this series.

Part I: The Importance of Normalizing Your Data

Part III: What Activation Functions Allow You To Learn

Let’s start with the following network with one hidden layer containing six neurons all activated with ReLU. Let’s call this Network A:

To keep things simple, throughout this article we will activate hidden neurons (above in green) using the ReLU activation function because a neuron will be active if the input to the activation is greater than 0 and inactive otherwise.

Non-Centered Data VS Non-Centered Weights

When all the data is positive and the weights are centered around 0, whether a neuron activates (i.e. whether WX > 0) depends entirely on the weights. Since all the data is positive, we can expect a neuron to be activated about 50% of the time. The simulations in Part I showed how neurons are activated by either all or none of the data and rarely anything in between.

If the neuron activates, the gradient of a positive weight will always be positive (since the data is also always positive) and the gradient of a negative weight will always be negative. This severely limits the network’s ability to learn since the direction a weight can move in during training is fixed by its initial value.

When the data is centered around 0 but the weights are all positive, whether a neuron activates (i.e. whether WX > 0) depends entirely on the data. Since all weights are positive, we can expect neurons to activate based on a fixed 50% chunk of the data. This means either all neurons will be activated or none will be. It’s rare to find a combination of weights that make only a few neurons activate.

This also limits the network’s ability to learn because the direction of the gradient is determined only by the data.

As long as both the data and weights contain 0, whether a neuron activates depends on both the weights and the data.This means gradients are able to update the weights more freely in different directions as they can be positive or negative regardless of initialization.

Why Include A Bias Term

Even when both weights and data are centered, having neurons activate on a random 50% portion of the data still seems rigid. The more neurons a network contains, the more overlap and duplication we might expect in the learned features.

We may be able to learn more effectively by further dividing and conquering. What if each neuron gets a random proportion of the data?

But how?

Let’s include a bias term!

Suddenly the proportion of data that makes a given neuron activate is more flexible, falling between 0 to 100% of the data.

Shifting the bias away from 0 leads to systematic activation or deactivation of neurons in the network:

This is something to watch out for as neurons will be frozen if the bias is too large or too small on average.

What about the Variance?

So far we’ve talked about centering the data, weights, and bias before training. Let’s see what happens as the variance of each of their distribution changes.

Changing the variance of the distribution of the bias (thus changing the range of possible values it can take) allows us to adjust the tail end of the distribution:

Even when centering at 0, we need to be careful of the range of possible values of the data, weights and biases.

If the variance (range) of the data and the weights at initialization are too large, the gradients may be too big and the network may overshoot the minimum cost. This resembles choosing a step size that is too high during gradient descent.

Conversely, if the range is too small, the gradients may be very small — impeding the network’s ability to learn — , or the weight initialization would be very close to constant (or zero-initialization in this case) which would freeze the model because all the gradients would be the same.

If the variance (range) of bias is too high we will see neurons switch on or off completely. It will be rare for a neuron to activate based on 50% of the data at initialization. As the variance is reduced the initialization of the bias gets closer to constant which is not as detrimental to learning as constant initialization is for the weights.

Many Hidden Layers

Now, consider the following network:

This network is just Network A from the previous example but with an added layer before it. The intuition we gained above for Network A, still applies but now the “data” for the Network A portion of the network is the activated output of the first layer.

The “data” for the second hidden layer (with six neurons) is the output of the first hidden layer (with two neurons). If that “data” is not zero-centered, we would encounter the same issues we learned about in Part I.

The sigmoid and ReLU outputs, for example, are not zero-centered and are always positive. Therefore, using the sigmoid or ReLU activation in hidden layers is not recommended as it may freeze neurons in deeper layers of the network at initialization because the learned features would be effectively constant. Sigmoid also suffers from the vanishing gradient problem in deeper layers.

Below are the activations of a network after training. The data, weights, and biases are all centered at initialization and all hidden layers are ReLU activated.

The first layer seems to activate somewhat randomly. Then, as we progress through the network, neurons become increasingly frozen. In this case, using fewer layers would have been better for ReLU.

Conclusion

Even with zero-centered data, the network may still be frozen at initialization if:

The weights and bias are not zero-centered.
The variance (range) of the data, weights, or bias is too large.
Non-zero-centered activation functions are used in deep hidden layers of the network.

Consider therefore:

Including a bias term to randomize and possibly minimize the duplication of work in the network.
Using a zero-centered activation in hidden layers to help counteract unnormalization of previous layers or preserve normalization throughout.
That the data, weights, and bias are all interlinked and must be carefully examined together at initialization.
Using Batch Normalization when using multiple hidden layers.