# Training a Deep Neural Net..

A neural net job is to correctly/accurately predict “y” for a given feature input “X”. The goal of the training is to learn parameters “Weights, Wᵢ” and “Biases, bᵢfor all nodes in all layers to make predicted output for each training observation as close as possible to the true value. Lets denote all the network learnable parameter ( Wᵢ,bᵢ of all nodes ) as θ

Generally in machine learning we follow 2 important steps to facilitate this training procedure,
1. we will define loss function that calculates the distance between the predicted output and the actual output,
2. Find the parameters that minimise this loss function, i.e minimise the distance between actual and predicted. we’ll use the gradient descent algorithm to optimise for that parameter ( In our example the parameters are “Weights” and “Biases” of each node)

In traditional machine learning models like logistic regression, calculating this gradients is straight forward. But in Neural networks, when the weights of one neuron is not directly impacts the loss, its lot harder to compute the partial derivative of some weight in layer i when the loss is attached to some much later layer.
To find out the gradient for a given weight, we will need to back-propagate the error all the way from the output node to the node of interest. Once obtaining the gradients with respect to all learnable parameters in the network, we can update the parameters with optimisation. Lets look at each step in detail:

# Loss Calculation

Once an output prediction y^ is computed for the input X and the network parameters θ. We can calculate loss/error L(ŷ,y), where “y” is the true value of input X.

# Error Back-Propagation

For f (x) = u(v(w(x))),
df(x)/dx can be computed using the as shown in Fig-chain rule.

Let’s look at an example on how to use chain rule for calculating the gradients for a parameter θᵢ:

# Parameter Update..

The amount that the weights are updated during training is referred to as the step size or the “learning rate.” If the learning rates are high then the parameter update will be larger. What we discussed till now for training a neural networks is based on gradient descent methods, specifically stochastic gradient decent. The idea of stochastic gradient descent is very simple, it updates to a set of weights θ in the direction of the gradient: dL(ŷ,y)/dθ) to reduce the error L(ŷ,y).

RMS-Prop :It divides the learning rate by an average of squared gradients and decays this exponentially.

# Note..

## Nerd For Tech

From Confusion to Clarification

## Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Written by

## Kowshik chilamkurthy

RL | ML | ALGO TRADING | TRANSPORTATION | GAME THEORY ## Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.