# NLP Zero to One: Deep Learning Training Procedure (Part 4/30)

## Back-propagation, Loss Functions and BatchNorm

# Introduction..

In the last blog, we discussed how a perceptron algorithm works in basic computation block called node/neuron and also learnt how these neurons come together to form a deep fully connected and layered neural networks. We also discussed different activation functions. In this blog, we will concentrate on learning/training procedure of neural networks.

# Training a Deep Neural Net..

A neural net job is to correctly/** accurately** predict “y” for a given feature input “X”. The goal of the training is to learn parameters

*“Weights,*Wᵢ

*” and “Biases,*bᵢ

*”*for all nodes in all layers to make predicted output for each training observation as close as possible to the true value. Lets denote all the network learnable parameter ( Wᵢ,bᵢ of all nodes ) as θ

Generally in machine learning we follow 2 important steps to facilitate this training procedure,

1. we will define** loss function** that calculates the distance between the predicted output and the actual output,

2. Find the parameters that minimise this loss function, i.e minimise the distance between actual and predicted. we’ll use the gradient descent algorithm to optimise for that parameter ( *In our example the parameters are “Weights” and “Biases” of each node*)

In traditional machine learning models like logistic regression, calculating this gradients is straight forward. But in Neural networks, when the weights of one neuron is not directly impacts the loss, its lot harder to compute the partial derivative of some weight in layer i when the loss is attached to some much later layer.

To find out the gradient for a given weight, we will need to **back-propagate the error **all the way from the output node to the node of interest. Once obtaining the gradients with respect to all learnable parameters in the network, we can update the parameters with optimisation. Lets look at each step in detail:

# Loss Calculation

The Loss computation step gives us how well our network predicted on the feature Input X. So for a given input X, we must device a loss function **L(prediction, actual ). **Depends on the whether its classification or regression, the loss function is defined. Let’s look at some popular loss functions:

Once an output prediction y^ is computed for the input X and the network parameters θ. We can calculate** loss/error L(ŷ,y)**, where “y” is the true value of input X.

# Error Back-Propagation

To improve our prediction, we can use SGD to decrease the error of the whole network. For calculating the gradient(derivative: *d***L(ŷ,y)**/*d*θᵢ) for each parameter θᵢ, we can use chain rule of calculus.

Before looking at the example on how to calculate gradients, lets briefly understand the concept of chain rule:

For f (x) = u(v(w(x))), ** df(x)/dx** can be computed using the as shown in Fig-chain rule.

Let’s look at an example on how to use chain rule for calculating the gradients for a parameter θᵢ:

# Parameter Update..

After obtaining the gradients with respect to all learnable parameters θᵢ in the network, we can update the parameters for each layer according to the learning rate α.

The amount that the weights are updated during **training** is referred to as the step size or the **“learning rate.”** If the learning rates are high then the parameter update will be larger. What we discussed till now for training a neural networks is based on** gradient descent methods, specifically stochastic gradient decent. **The idea of stochastic gradient descent is very simple, it updates to a set of weights θ in the direction of the gradient: *d***L(ŷ,y)**/*d*θ) to reduce the error **L(ŷ,y).**

There are other optimisation methods like stochastic gradient decent, the most popular methods are *Adagrad, RMS-Prop and ADAM.* Let’s briefly discuss them:

**Adagrad:** it is** **adaptive gradient-based optimization method. It adapts the learning rate to each of the parameters in the network. it does more updates to infrequent parameters, and less to frequent parameters.

**RMS-Prop**

*:It divides the learning rate by an average of squared gradients and decays this exponentially.*

**ADAM:**

*Adaptive moment estimation like Adagrad is a adaptive optimization method. It also compute learning rates for each parameter, In addition, it also incorporates an average of past gradients[1]*

# Note..

**Mini-Batch Gradient Decent: **Neither the Whole data is used for updating the model nor a single data point in data is used for updating the model. Mini-batch gradient descent splits the dataset into batches, and the error for a single update is calculated using the mini-batch. It has faster training and quicker convergence.**More Topics:** Vanishing gradients, Regularisation(L1,L2), Drop-out and Batch Normalisation are other important concepts in Deep Learning**.** We will discuss these topics in details in the coming blogs.

Previous: **NLP Zero to One: Deep Learning Theory Basics (Part 3/30)**

Next: **NLP Zero to One: Dense Representations, Word2Vec (Part 5/30)**