Deep Learning | Batch Normalization

Chris
4 min readAug 9, 2022

--

When we train a neural network, we want to standardize or normalize our dataset in preprocessing step to be ready for training. normalization and standardization both have the goal of transforming data to the same scale.

For example, we might need the data distribution range 1 ~ 1000 to transform into range 0~1.

https://www.thoughtco.com/calculate-a-sample-standard-deviation-3126345

Like standardization, we make data have mean 0and deviation 1

Why we need normalization

If we didn’t normalize our data, we might face that we have some numerical data points in our dataset that might be very high and others might be very low. The data has a relatively wide range and is not on the same scale. These could cause instability in the neural network and cause relatively large input cascading down through the layers in the network, which may cause imbalance gradients, which may cause the exploding gradient. The imbalance of non-normalized data makes it drastically hard to train. Additionally, non-normalized data can significantly decrease our training speed. If we put all the data on the same scale, it attempts to increase the training speed and avoid problems like imbalance gradient.

Why we need batch normalization

During training, one of the weights becomes drastically larger than the other weights. This weight will cause the output from its corresponding neural to be extremely large. the imbalance will continue to cascade through the neural network, causing instability.

This is where batch normalization comes into play. batch normalization is applied to layers you choose to apply to within your network. When applying a batch norm to a layer, the first thing the batch norm does is normalize the output from the activation function. The output from a layer is passed to an activation function which transforms the output depending on the function itself, before being passed to the next layers as input after normalizing the output from the activation function. The batch norm then multiplies this normalized output by some arbitrary parameter and adds another parameter to this resulting product. This calculation with two arbitrary parameters sets a new standard deviation and mean for the data.

we need gamma and beta, which are two arbitrary parameters to result in the product of y.

Therefore, we need for parameters: deviation , mean , and two arbitrarily set parameters They are all trainable, meaning they will become optimized during the training process.

This process makes the weights within the network not become imbalanced with extremely high or low values since the normalization is included in the gradient process.

This additional batch norm to the model can greatly increase the speed at which training occurs and reduce the ability to outlying large weights that will overinfluence the training process.

Different between norm and batch norm

normalize our input data in the preprocessing step before training occurs, we found that normalization happens to the data before being passed into the input layer.

With batch-normalization, we can also normalize the output data from the activation functions for individual layers with our model. So we have normalized data come in and also have normalized data within the model itself

https://medium.com/@abheerchrome/batch-normalization-explained-1e78f7eb1e8a

Code:

from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization
model = Sequential([
Dense(16, input_shape=(1, 5), activation='relu'),
Dense(32, activation='relu'),
BatchNormalization(axis=1),
Dense(2, activation='softmax')
])
# beta_initializer: Initializer for the beta weight.
# gamma_initializer: Initializer for the gamma weight.

Reference:

--

--