Tuning Neural Networks Part I

The Importance of Normalizing Your Data

6 min readOct 10, 2021

This series aims to provide a deep understanding of neural networks by examining how tuning parameters can affect what and how they learn. The content assumes some prior knowledge of neural networks which you can acquire by reading this series.

Part II: Considerations For Initialization

Part III: What Activation Functions Allow You To Learn

“Divide and conquer” is at the core of how Neural Networks learn. Increasing the number of nodes in the network should theoretically (according to the universal approximation theorem) result in a better model of the target (ignoring overfitting).

Learned features are an activated function of a weighted sum of the nodes in the previous layer.

Without the use of an activation function, the learned features end up simplifying to a single linear function of the input. This means that changing the architecture of the network (adding layers and nodes) won’t help combat this simplification and we can no longer “divide and conquer”.

However, using non-linear activation functions is still not sufficient to ensure that the Neural Network can divide and conquer in practice.

When does a node activate?

To understand why, let’s look at the ReLU activation function (we look at activation functions in depth in Part III):

ReLU simplifies reasoning about node activation because it’s either zero and not activated or a linear function of the input and activated.

If the weighted sum of the nodes in the previous layer is always greater than zero (i.e. for all data points) then the node is no different than a node that has no activation function.

So what makes the weighted sum of the nodes in the previous layer always be greater than zero?

Reading the Keras documentation we see that the default initialization method is to generate weights uniformly at random on the interval [-L, L] where L is some parameter.

Let’s assume that all weights come from U(-1, 1) and all inputs X_i each also comes from U(-1, 1). Ignoring the bias for now, how often is WX > 0 for a given hidden neurons in the network?

The distribution of the proportion of our data that makes a given neuron activate at initialization is tough to compute so let’s run a simulation (again ignoring the bias term for now):

So we have a good chance that about half our data will activate a given neuron after initialization — each neuron getting a random 50% chunk of our data. This seems to be due to the symmetry of our data around 0.

What happens if our data is not centered around 0? Here is an animation of the distribution of the proportion of data responsible for activating a given neuron as our data shifts from U(-1, 1) to U(8, 10) while the weights remain U(-1, 1).

As our data shifts farther from 0, neurons activate based on either all the data or none of it at initialization. This is exactly the situation we want to avoid.

In the case of ReLU, if a neuron never activates, then its a constantly 0 feature and is essentially useless. If the neuron activates on all our data then the neuron is learning a global linear transformation of the data which is also not helpful in the context of neural networks.

This is the case for all neurons so adding more layers and neurons won’t help us.

Does centering around 0 affect the distribution of the number of activated neurons? Again, fixing the weight distribution to be U(-1, 1) but varying the distribution of the data:

So we can still expect about half the neurons to be activated at any given time — but as the data shift away from 0, these neurons will be activated by all or none of the data.

This makes intuitive sense: if all values of X are positive, then WX > 0 as often as W > 0 — which is about half the time if W’s distribution is zero-centered.

Let’s visualize this phenomenon on an actual Neural Network. Let’s create a Neural Network that can learn to distinguish blue from green below:

For each data point we feed to our network, we can visualize the activations of each neuron. On the left are the activations right after initialization (as above the hidden layers — in green — use the ReLU activation and the output layer is using sigmoid) and on the right are the activations after training:

Left = at initialization, Right = after training | Green/Blue = activated, White= not activated

Node values close to 0 are in white and node values far from 0 are Green / Blue.

We see a lot of movement and variability above: it seems that every input data point makes a very different number of nodes activate and not activate.

Here is the animation of the learning process:

Now take a look at what happens when the data is not centered around 0! For data centered at [10, 10], you will see on the left the network after initialization and on the right after training:

The network seems frozen. The features are either global linear transformations of the data or are constantly 0. And after epochs of training (image on the right) this does not improve because it requires finding weights that only activate portions of the data which we’ve established has very low probability of occurring. Here is the animation of the learning process:

So is this just a ReLU issue? Or can we use a different activation function to bypass the learning dead-end that our unnormalized data has created?

It depends how far away from zero the data has shifted. Consider the sigmoid function:

If the data is around [10,10] as above, then since WX can take on very large magnitudes, we can expect 𝞂(WX) to be either effectively 0 or effectively 1 most of the time. In both cases the neuron has learned an effectively constant feature which does not help the learning process.

So we can expect the same behavior from other activation functions.

Here are the activations after initialization (on the left) and after training (on the right) for the same Network as above but using the sigmoid activation for all the hidden layers:

There doesn’t seem to be much of a difference between the activations at initialization and the activations after learning. We can plot the decision boundary through the learning process to see what was learned:

It looks like our model is still just doing Logistic Regression (i.e. the features learned are just linear transformations of the input).

Conclusion

In order for the model to learn effectively, it needs to divide and conquer. Each neuron should learn a transformation that activates or deactivates on a small portion of the data. That local transformation from each neuron can then be aggregated to accomplish the global learning task.

So look at the data before trying all sorts of activation functions, initialization methods, architectures etc. Normalizing your data is not just good practice in order to leverage the power of Neural Networks — it’s a requirement.