Artificial Neural Networks- An intuitive approach Part 4

Niketh Narasimhan

Published in

Analytics Vidhya

10 min readJul 25, 2020

A continuation of an earlier article

Please find the link for earlier article

https://medium.com/@nikethnarasimhan/artificial-neural-networks-an-intuitive-approach-part-3-a5888af9ac0

Weights Initialization
Xavier initialization
He initialization
Normalization methods

Weights Initialization

The importance of effective initialization

To build a machine learning algorithm, usually you’d define an architecture (e.g. Logistic regression, Support Vector Machine, Neural Network) and train it to learn parameters. Here is a common training process for neural networks:

Initialize the parameters
Choose an optimization algorithm
Repeat these steps:
Forward propagate an input
Compute the cost function
Compute the gradients of the cost with respect to parameters using backpropagation
Update each parameter using the gradients, according to the optimization algorithm

Then, given a new data point, you can use the model to predict its class.

The initialization step can be critical to the model’s ultimate performance, and it requires the right method. To illustrate this, consider the three-layer neural network below. You can try initializing this network with different methods and observe the impact on the learning.

Case 1: A too-large initialization leads to exploding gradients

If weights are initialized with very high values the term np.dot(W,X)+b becomes significantly higher and if an activation function like sigmoid() is applied, the function maps its value near to 1 where the slope of gradient changes slowly and learning takes a lot of time.When these activations are used in backward propagation, this leads to the exploding gradient problem. That is, the gradients of the cost with the respect to the parameters are too big. This leads the cost to oscillate around its minimum value.

Case 2: A too-small initialization leads to vanishing gradients

If weights are initialized with low values it gets mapped to 0.

When these activations are used in backward propagation, this leads to the vanishing gradient problem. The gradients of the cost with respect to the parameters are too small, leading to convergence of the cost before it has reached the minimum value.

or intuitively we can use the below reasoning to understand the above mentioned points:

When your weights and hence your gradients are close to zero, the gradients in your upstream layers vanish because you’re multiplying small values and e.g. 0.1 x 0.1 x 0.1 x 0.1 = 0.0001. Hence, it’s going to be difficult to find an optimum, since your upstream layers learn slowly.
The opposite can also happen. When your weights and hence gradients are > 1, multiplications become really strong. 10 x 10 x 10 x 10 = 1000. The gradients may therefore also explode, causing number overflows in your upstream layers, rendering them “untrainable” (even dying off the neurons in those layers).

So thus we can conclude that we have to keep the variances of the weights initialized approximately equal to 1 across all layers.

Xavier initialization

We need to pick the weights from a Gaussian distribution with zero mean and a variance of 1/N, where N specifies the number of input neurons.

With this strategy, which essentially assumes random initialization from e.g. the standard normal distribution but then with a specific variance that yields output variances of 1.This is for TanH function

He initialization

When your neural network is ReLU activated, He initialization is one of the methods to chose, Mathematically it attempts to do the same thing

This difference is related to the nonlinearities of the ReLU activation function, which make it non-differentiable at x=0. However at other values it is either 0 or 1 as explained in the image above .The best weight initialization strategy is to initialize the weights randomly but with this variance:

Normalization methods:

Let us recall the meanings of Normalization and standardization in it’s most basic form.

A typical normalization process consists of scaling numerical data down to be on a scale from zero to one, and a typical standardization process consists of subtracting the mean of the dataset from each data point, and then dividing that difference by the data set’s standard deviation.

This forces the standardized data to take on a mean of zero and a standard deviation of one. In practice, this standardization process is often just referred to as normalization as well.

n general, this all boils down to putting our data on some type of known or standard scale. Why do we do this?

Well, if we didn’t normalize our data in some way, we can imagine that we may have some numerical data points in our data set that might be very high, and others that might be very low.

For now, understand that this imbalanced, non-normalized data may cause problems with our network that make it drastically harder to train. Additionally, non-normalized data can significantly decrease our training speed.

The larger data points in Non-Normalized data cause instability in neural networks because:

The large inputs can cascade down through the layers in the network, which may cause imbalanced gradients, which may lead to the “Exploding gradient problem”.
Non-Normalized data significantly decreases the training speed.
If data is not normalized small changes in weights can misrepresent the decision boundary drastically.

Let us introduce a new an important concept

Internal Covariate shift:

Let’s say you have a goal to reach, which is easier, a fixed goal vs a goal that keeps moving about? It is clear that a static goal is much easier to reach than a dynamic goal.

Each layer in a neural net has a simple goal, to model the input from the layer below it, so each layer tries to adapt to it’s input but for hidden layers, things get a bit complicated. The input’s statistical distribution changes after a few iterations, so if the input statistical distribution keeps changing, called internal covariate shift, the hidden layers will keep trying to adapt to that new distribution hence slowing down convergence. It is like a goal that keeps changing for hidden layers.

Very deep models involve the composition of several functions or layers. The gradient tells how to update each parameter, under the assumption that the other layers do not change. In practice, we update all of the layers simultaneously.

Because all layers are changed during an update, the update procedure is forever chasing a moving target.

For example, the weights of a layer are updated given an expectation that the prior layer outputs values with a given distribution. This distribution is likely changed after the weights of the prior layer are updated.

So the batch normalization (BN) algorithm(Covered ahead in depth)tries to normalize the inputs to each hidden layer so that their distribution is fairly constant as training proceeds. This improves convergence of the neural net.

Let us quickly sum up why normalization is essential:

Every feature is normalized , thus they are transformed into the same scale , hence their contribution remains unbiased irrespective of the values of the non-normalized features being high or low.
It reduces Internal Covariate Shift. It is the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift.
Batch Norm is known to make the loss surface smoother and get a well defined decision boundary.
It makes the Optimization faster because normalization tackles the exploding gradient problem and ensures the weights are distributed in a uniform manner.

Let us deep dive into the various Normalization techniques:

Batch Normalization:

Batch normalization is a normalization method that normalizes activation in a network across the mini-batch. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation.

Note : Gamma and Beta are learnable parameters in the above case.They can be used to scale up or down the weights back to the original value as per expectations.

Batch Normalization advantages:

Batch Normalization solves the problem of exploding gradients

2. Batch normalization makes the loss surface “easier to navigate”, making optimization easier, enabling the use of higher learning rates, and improving model performance across multiple tasks.

Batch Normalization disadvantages:

Dependence on Batch Size → If we have a batch size of 1, subsequently the variance becomes 0 , therefore the batch norm doesn’t work. When the size of mini-batch becomes too small then it becomes too noisy and training gets affected.
Recurrent Neural Network → Doesn’t work with RNN’s-In an RNN, the recurrent activations of each time-step will have different statistics. This means that we have to fit a separate batch normalization layer for each time-step. This makes the model more complicated and — more importantly — it forces us to store the statistics for each time-step during training.

Weights Normalization:

Weights Normalization is a method that normalizes the weights instead of the mini batches

Weight normalization parameterize the weights vector as

Similar to batch normalization, weight normalization does not reduce the expressive power of the network. What it does is it separates the norm of the weight vector from its direction. It then optimizes both

In general ,mean-only batch normalization and weight normalization is used to get get the desired output even in small mini-batches. It means that they subtract out the mean of the minibatch but do not divide by the variance. Finally, they use weight normalization instead of dividing by variance.

Advantages

Other than the mean and variance being independent of the batch,
Weight normalization is often much faster than batch normalization. In convolutional neural networks, the number of weights tends to be far smaller than the number of inputs, meaning weight normalization is computationally cheaper compared to batch normalization. Batch normalization requires passing through all the elements of the input, which can be extremely expensive, especially when the dimensionality of the input is high, such as in the case of images. Convolutions use the same filter at multiple locations, so a pass through the weights is a lot faster.

“Mean-only batch Normalization” with weight normalization.

This method is the same as batch normalization except it does not divide the inputs by the standard deviation or rescale them. Though this method counteracts some of the computational speed-up of weight normalization, it is cheaper than batch-normalization since it does not need to compute the standard deviations. This method provides the following benefits:

1. It makes the mean of the activations independent from

Weight normalization independently cannot isolate the mean of the activations from the weights of the layer, causing high-level dependencies between the means of each layer. Mean-only batch normalization can resolve this problem.

2. It adds “gentler noise” to the activations

One of the side-effects of batch normalization is that it adds some stochastic noise to the activations as a result of using noisy estimates computed on the mini-batches. This has a regularization effect in some applications but can be potentially harmful in some noise-sensitive domains like reinforcement learning. The noise caused by the mean estimations, however, are “gentler” since the law of large numbers ensures the mean of the activations is approximately normally distributed.

Layer Normalization

A mini-batch consists of multiple examples with the same number of features. Mini-batches are matrices — or tensors if each input is multi-dimensional — where one axis corresponds to the batch and the other axis — or axes — correspond to the feature dimensions.

Batch normalization normalizes the input features across the batch dimension. The key feature of layer normalization is that it normalizes the inputs across the features.

It is very similar to batch normalization however the difference can be visualized in the diagrams below:

In batch normalization, the statistics are computed across the batch and are the same for each example in the batch. In contrast, in layer normalization, the statistics are computed across each feature and are independent of other examples.

Advantages:

>-Layer Norm’s work better than Batch Norm for RNN’s.

>-The independence between inputs means that each input has a different normalization operation, allowing arbitrary mini-batch sizes to be used.

Instance(or Contrast) Normalization

Instance normalization is similar to layer normalization but goes one step further: it computes the mean/standard deviation and normalize across each channel in each training example

The advantages of instance normalization are mentioned below

This normalization simplifies the learning process of a model.
The instance normalization can be applied at test time.
Therefore, it is specific to images and not trivially extendable to RNNs.
Experimental results show that instance normalization performs well on style transfer when replacing batch normalization.
Recently, instance normalization has also been used as a replacement for batch normalization in GANs.

Group Normalization

Group Normalization as its name suggests — computes the mean and standard deviation over groups of channels for each training example. In a way, group normalization is a combination of layer normalization and instance normalization. Indeed, when we put all the channels into a single group, group normalization becomes layer normalization and when we put each channel into a different group it becomes instance normalization.

Group normalization can be said as an alternative to batch normalization. This approach works by dividing the channels into groups and computes within each group the mean and variance for normalization i.e. normalising the features within each group. Unlike batch normalization, group normalization is independent of batch sizes, and also its accuracy is stable in a wide range of batch sizes.

Advantages

The advantages of group normalization are mentioned below:

It has the ability to replace batch normalization in a number of deep learning tasks
It can be easily implemented in modern libraries with just a few lines of codes

Visually we can understand the different layer norms by going through the diagram below: