NLP Zero to One: Deep Learning Training Procedure (Part 4/30)

Back-propagation, Loss Functions and BatchNorm

Kowshik chilamkurthy
Mar 2 · 5 min read


Training a Deep Neural Net..

A neural net job is to correctly/accurately predict “y” for a given feature input “X”. The goal of the training is to learn parameters “Weights, Wᵢ” and “Biases, bᵢfor all nodes in all layers to make predicted output for each training observation as close as possible to the true value. Lets denote all the network learnable parameter ( Wᵢ,bᵢ of all nodes ) as θ

Generally in machine learning we follow 2 important steps to facilitate this training procedure,
1. we will define loss function that calculates the distance between the predicted output and the actual output,
2. Find the parameters that minimise this loss function, i.e minimise the distance between actual and predicted. we’ll use the gradient descent algorithm to optimise for that parameter ( In our example the parameters are “Weights” and “Biases” of each node)

In traditional machine learning models like logistic regression, calculating this gradients is straight forward. But in Neural networks, when the weights of one neuron is not directly impacts the loss, its lot harder to compute the partial derivative of some weight in layer i when the loss is attached to some much later layer.
To find out the gradient for a given weight, we will need to back-propagate the error all the way from the output node to the node of interest. Once obtaining the gradients with respect to all learnable parameters in the network, we can update the parameters with optimisation. Lets look at each step in detail:

Loss Calculation

Different Loss functions, generated by author

Once an output prediction y^ is computed for the input X and the network parameters θ. We can calculate loss/error L(ŷ,y), where “y” is the true value of input X.

Error Back-Propagation


For f (x) = u(v(w(x))),
df(x)/dx can be computed using the as shown in Fig-chain rule.

Let’s look at an example on how to use chain rule for calculating the gradients for a parameter θᵢ:

Example generated by author

Parameter Update..

Source [1]

The amount that the weights are updated during training is referred to as the step size or the “learning rate.” If the learning rates are high then the parameter update will be larger. What we discussed till now for training a neural networks is based on gradient descent methods, specifically stochastic gradient decent. The idea of stochastic gradient descent is very simple, it updates to a set of weights θ in the direction of the gradient: dL(ŷ,y)/dθ) to reduce the error L(ŷ,y).

There are other optimisation methods like stochastic gradient decent, the most popular methods are Adagrad, RMS-Prop and ADAM. Let’s briefly discuss them:

Adagrad: it is adaptive gradient-based optimization method. It adapts the learning rate to each of the parameters in the network. it does more updates to infrequent parameters, and less to frequent parameters.
RMS-Prop :It divides the learning rate by an average of squared gradients and decays this exponentially.
ADAM: Adaptive moment estimation like Adagrad is a adaptive optimization method. It also compute learning rates for each parameter, In addition, it also incorporates an average of past gradients[1]


Generated by author

Previous: NLP Zero to One: Deep Learning Theory Basics (Part 3/30)
Next: NLP Zero to One: Dense Representations, Word2Vec (Part 5/30)

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Kowshik chilamkurthy

Written by


Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit