Batch Normalization in Deep Learning
The behavior of machine learning algorithms can change when the input distribution changes: if the train and test sets come from entirely different sources (e.g. training images come from the web while test images are pictures taken on the iPhone), the distributions would differ.
A standard conventional Neural Networks model consists of interconnected nodes called neurons, each producing an ordered value activation. The activation starts from the input neurons and the other layer neurons get activated from the weighted connection of the previously activated neurons.
In the context of deep learning, we are particularly concerned with the change in the distribution of the inputs to the inner nodes within a network. A neural network changes the weights of each layer over the course of training. This means that the activations of each layer change as well. Since the activations of a previous layer are the inputs of the next layer, each layer in the neural network is faced with a situation where the input distribution changes with each step. This is problematic because it forces each intermediate layer to continuously adapt to its changing inputs.
(Internal Covariate Shift :The distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.)
Getting normalization right can be a crucial factor in getting your model to train effectively.
Batch Normalization
For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation. Resulting in transforming the inputs to be mean 0 and unit variance.
is a technique for improving the speed, performance, and stability of artificial neural networks. It is used to normalize the input layer by adjusting and scaling the activations.
Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.[*]
The mean and standard-deviation are calculated per-dimension over the mini-batches and γ and β are learnable parameter vectors of size C (where C is the input size). Two new parameters, namely a new mean and standard deviation, Beta and Gamma respectively, that allow the automatic scaling and shifting of the standardized layer inputs.
In practice with Pytorch:
- The normalization functionality is imported with the rest of the dependencies:
from ._functions import SyncBatchNorm as sync_batch_norm
- An example of code where we see how it’s implemented in creating the model: (source):
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(in_channels=1, out_channels=10,
kernel_size=5,
stride=1)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_bn = nn.BatchNorm2d(20)# *
self.dense1 = nn.Linear(in_features=320, out_features=50)
self.dense1_bn = nn.BatchNorm1d(50)# *
self.dense2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_bn(self.conv2(x)), 2))#**
x = x.view(-1, 320) #reshape
x = F.relu(self.dense1_bn(self.dense1(x)))#**
x = F.relu(self.dense2(x))
return F.log_softmax(x)
*In these lines we are defining the Normalization function and choosing the necessary parameters
**In these lines we applying the Normalization function that we defined in the first part of the model. The normalization only applied on the second convolutional layer and the first fully connected layer (dense1)
Batch normalization function has another parameter by default, in the previous code they changed just one parameter number of features (size of the input, C as previously mentioned):
torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
Please refer to the official documentation for further details about the impact of default parameters.
References
- Official Pytorch documentation website
- Related research paper:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,2015.
- Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models, 2017.
- [*] Prof. Andrew Ng ytbe video Why Does Batch Norm Work? (C2W3L06)
- FastAI3 part2 L10 Notes