# Batch Normalization

Before going to our topic let see why Normalization is an essential thing for data. When coming to the data analysis/prediction part.

**What is Data Normalization?**

Today’s world is engaged with data from our day-to-day life. Say instance buying the products from Amazon, commenting on the products that you have purchased before, Adding to that daily watching web series all in place of data collection. The same data are again useful for forecasting/customer recommendations too. Okay!. Let’s start with why the data to be normalized. In any database world, the data should be normalized before making use of data for analysis/business needs. Normally the data might be having a redundant structure also. We need to remove the redundant data as well. Then need to have a key mapping between tables to have a better understanding. This all comes in place for data to be normalized. When the data is normalized the query fetching between the tables/application level CRUD operation can be done much easier. Let’s start understanding why normalization needs in ML.

**Normalization in ML**

Consider the dataset having two types of variables, one is measured in miles and another one is measured in hours. Without applying the normalization the model doesn’t provide the best results. Hence we need of normalizing the data before applying the model. There is a step called ‘Data PreProcessing’. This will be taking care of data normalization by applying any three techniques as follows.

**Rescaling:**also known as “min-max normalization”, it is the simplest of all methods and calculated as:

**Mean normalization:**This method uses the mean of the observations in the transformation process:

**Z-score normalization:**Also known as standardization, this technic uses Z-score or “standard score”. It is widely used in machine learning algorithms such as SVM and logistic regression:

Here, z is the standard score, µ is the population mean and ϭ is the population standard deviation.

Hope with all the understanding in place. Let’s start Batch Normalization in detail.

**Why Batch Normalization needed?**

The Batch Normalization applies to Deep Learning models. Where the input structured in the form of a layer. The layer having multiple neurons and each of its assigned a weight in it. The added weight is passed as input to another layer. The second layer used the weight and process and fed the input for another layer. The whole dataset is not passed as input to the layer instead as mini-batches.

As different mini-batches of data are loaded and passed through the network, the input distribution to the layers jumps around, making life harder for our layers to do their job. In addition to fitting the underlying distribution, the layer in question now also has to account for the drifts in the layer input distribution. This phenomenon of shifting input distributions is known as the ** Internal Co-variate shift**.Hence by using the Batch Normalization technique this problem get solved.

# So, what is Internal Covariate Shift??

“Internal Covariate Shiftis the change in the distribution of network activations due to the change in network parameters during training.”

The deeper your network, the more tangled of a mess *internal covariate shift *can cause. Let’s remember that Neural Networks learn and adjust their weights through a mathematical game of telephone (the more people, or ‘layers’ put in the chain, the more messed up the message is going to get). As builders of neural networks, our job is to **stabilize** and improve the connection between our output layer’s results, and each hidden layers’ nodes.

An

internal covariate shiftoccurs when there is a change in the input distribution to our network. When the input distribution changes, hidden layers try to learn to adapt to the new distribution. This slows down the training process. If a process slows down, it takes a long time to converge to a global minimum. This problem occurs when the statistical distribution of the input to the networks is drastically different from the input that it has seen before. Batch normalization and other normalization techniques can solve this problem.

Have you ever played the game telephone with cups and strings? Think of this as that, only you can tweak your phone to make it clearer (i.e., tune your parameters). The first guy tells the second guy, “go water the plants”, the second guy tells the third guy, “got water in your pants”, and so on until the last guy hears, “kite bang eat face monkey” or something totally wrong. Let’s say that the problems are entirely systemic and due entirely to faulty red cups.

Now, let’s say we can fix our cups (or get new ones) so that we pass messages better. We tell the last guy the right answer, and he fixes his cup a little bit and then tests it out by talking to the second-to-last guy through it. The second-to-last guy tells the third-to-last guy to fix something and all the way back to the first guy. Backpropagation, right?

The trouble is, everybody’s fixing stuff at the same time. So, when one guy tells the next guy stuff, he’s doing it with his new cup, i.e., parameters. And that’s bad because everyone is getting a new phone/cup based on what the guy after him told him…only the message changes because the cups change. To put it another way: your first layer parameters change and so the distribution of the input to your second layer changes. By changing around parameters, you’re intentionally causing something that Szegedy calls “internal covariate shift”. Usually, it’s not a problem with only a few layers; it gets pretty hairy when you’ve got a truly deep neural network.

This is not the end to it. There will be some gradient problems that arise in the deep learning model. The Exploding Gradient and Vanishing Gradient are the two problems that arise that will be more severe if the network seems to be huge. If the problem raises the model efficiency is reduced and the time takes to compute between the layers taking much time. With all in place, How can we resolve this? The answer is ‘Batch Normalization’ can solve this. Before starting let’s give an introduction to those gradient's stuff.

**Vanishing Gradient**

The Vanishing Gradient is the major problem for training the neural network. Each layer having multiple neurons. Each neuron having a weighted value in it. In training the (stochastic gradient descent)SCG calculates the gradient of the loss with the respect to weight in the network. Normally the weight of the neuron is considerably small. In that case, the gradient calculation for individual weight also leads to small. In other words, the gradient vanishes. Hence this problem called ‘Vanishing Gradient’. Do you think the Vanishing Gradient problem arises because of small gradient / stuck weights?. Well, we can see it better understanding.

**Small Gradients**

- Normally the gradient calculation leads to a very small value. If any value multiplied by a small value the result also leads to the small one. The same principle adapts here. The gradient calculates based on the weight of the neuron. So, the weighted value for the neuron is entirely dependent on causing the small Gradients.

**Stuck Weights**

- We already know the weight of the neuron is already very small and by the time of multiplying with the learning rate. We get a small value. So the new weighted value is calculated by subtracting the old one with the new calculated weight with the learning rate. Finally, the weighted value also getting small. The weighted value not moving and stuck. It’s not in the learning phase. We can say the new weighted value is approximately equal to the old weight value. There is no improvement in weight.

**Exploding Gradient**

The Exploding Gradient as the terms suggest as exploding the gradient. Yeah! in before cases we see how the gradient results in a very small but here kind of opposite say big weighted value result in large gradient. Hence the model doesn’t perform well and leads to less accuracy. Moreover, the calculated gradient weight reaches above the optimum value. For every new weighted calculation, the value get increases and further multiplied by the learning rate. The new weight will be very large. The new weight will be shifted enormously / quick high jump but not reaching the local minima.

**Batch Normalization**

Batch normalization is a layer that allows every layer of the network to do learning more independently. It is used to normalize the output of the previous layers. The activations scale the input layer in normalization. Using batch normalization learning becomes efficient also it can be used as regularization to avoid overfitting of the model. The layer is added to the sequential model to standardize the input or the outputs. Using this technique avoid ICS and gradient problem as well. In simple terms by normalizing the layer weight as mean as 0 and standard deviation as 1.

**How does Batch Normalization work?**

We now introduce the concept of Batch Normalization, which in effect, normalizes the output activations of a layer, and then does something more. Here’s a precise description.

The above equations describe what a batch norm layer does. Equations 2−4 describe how the mean and variance of each activation across a mini-batch is calculated, followed by subtraction by mean to zero centers the activations and dividing by the standard deviation. This is to make the standard deviation of each activation across the mini-batch unit (1).

Notice, that the mean and the variance being calculated here are the mean and the variance across the mini-batch.

Equation 5 is where the real magic happens. γ and β are the hyperparameters of the so-called batch normalization layer. The output of equation 5 has a mean of β and a standard deviation of γ. In effect, a batch normalization layer helps our optimization algorithm to control the mean and the variance of the output of the layer.

However, when we add the batch normalized layer between the layers, the statistics of a layer are only affected by the two hyperparameters γ and β. Now our optimization algorithm has to adjust only two hyperparameters to control the statistics of any layer, rather than the entire weights in the previous layer.

**Creating the Batch Normalization in Python**

There are two approaches to have a batch implementation in place. Let’s see in detail.

**Approach 1**

Here the batch normalization applies after the activation function.

LAYERS_BN = [

tf.keras.layers.Flatten(input_shape=[28, 28]),

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Dense(300, activation="relu"),

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Dense(100, activation="relu"),

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Dense(10, activation="softmax")

]model = tf.keras.models.Sequential(LAYERS_BN)

**Approach 2**

Here the batch normalization applies before the activation function. You can see the use_bias applied as False. Basically, the BN holds the constant value that can be used as bias.

LAYERS_BN_BIAS_FALSE = [

tf.keras.layers.Flatten(input_shape=[28, 28]),

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Dense(300, use_bias=False),

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Activation("relu"),

tf.keras.layers.Dense(100, use_bias=False),

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Activation("relu"),

tf.keras.layers.Dense(10, activation="softmax")

]model = tf.keras.models.Sequential(LAYERS_BN_BIAS_FALSE)

Further, compile and apply the model as follows,

## Model Compilation model.compile(loss="sparse_categorical_crossentropy",

optimizer=tf.keras.optimizers.SGD(lr=1e-3),

metrics=["accuracy"])## Model Fittinghistory = model.fit(X_train, y_train, epochs=10,

validation_data=(X_valid, y_valid))

Hope this article gives you a better understanding of batch normalization.

Thank you for reading this article. See you soon on my next post :)