Deep Neural Network for Classification from scratch using Python
In this article i will tell about What is multi layered neural network and how to build multi layered neural network from scratch using python. In this article i am focusing mainly on multi-class classification neural network. you can check my total work here.
Below are the three main steps to develop neural network. i will explain each step in detail below.
- Defining Neural Network Structure
- Initializing Weights for Network
- Train network using Gradient descent methods to update weights
1. Neural Network Structure:
As shown in above figure multilayered network contains input layer, 2 or more hidden layers ( above fig. contains 2 ) and an output layer. Each hidden layer contains n hidden units. input to the network is m dimensional vector. output layer contains p neurons corresponds to p classes. Each neuron in hidden layer and output layer can be split into two parts. those are pre-activation (Zᵢ), activation(Aᵢ). i will discuss more about pre-activation and activation functions in forward propagation step below. Each layer contains trainable Weight vector (Wᵢ) and bias(bᵢ) and we need to initialize these vectors. for training these weights we will use variants of gradient descent methods ( forward and backward propagation). so to build a neural network first we need to specify no of hidden layers, no of hidden units in each layer, input dimensions, weights initialization. after this we need to train the neural network. so typically implementation of neural network contains below steps
- Defining the neural network structure
- Initializing model parameters
- Training neural network ( Forward and Backward propagation)
2. Initializing the model parameters:
Training algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations. Moreover, training deep models is a sufficiently difficult task that most algorithms are strongly affected by the choice of initialization. — Deep Learning book.org
if all units in hidden layers contains same initial parameters then all will learn same, and output of all units are same at end of training .These initial parameters need to break symmetry between different units in hidden layer. so we will initialize weights randomly. Larger values of weights may result in exploding values in forward or backward propagation and also will result in saturation of activation function so try to initialize smaller weights. Typically we initialize randomly from a Gaussian or uniform distribution. The choice of Gaussian or uniform distribution does not seem to matter much but has not been exhaustively studied. some heuristics are available for initializing weights some of them are listed below.
he_normal → N(0,sqrt(2/fan-in))
he_uniform → Uniform(-sqrt(6/fan-in),sqrt(6/fan-in))
xavier_normal → N(0,2/(fan-in+fan-out))
xavier_uniform → Uniform(sqrt(6/fan-in + fan-out),sqrt(6/fan-in+fan-out))
There fan-in is how many inputs that layer is taking and fan-out is how many outputs that layer is giving.
Implemented weights_init function and it takes three parameters as input ( layer_dims, init_type,seed) and gives an output dictionary ‘parameters’ .
layer_dims → python list containing the dimensions of each layer in our network layer_dims list is like [ no of input features,# of neurons in hidden layer-1,.., # of neurons in hidden layer-n shape,output]
init_type → he_normal, he_uniform, xavier_normal, xavier_uniform
seed → random seed to generate weights
parameters — python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”: WL weight matrix of shape (layer_dims[l], layer_dims[l-1]) ,bL vector of shape (layer_dims[l], 1)
In above code we are looping through list( each layer) and initializing weights. I will discuss details of weights dimension, and why we got that shape in forward propagation step. sample output ‘parameters’ dictionary is shown below
2. Training Neural Network:
for training neural network we will approximate y as a function of input x called as forward propagation, we will compute loss then we will adjust weights ( function ) using gradient method called as back propagation.
Forward propagation nothing but a composition of functions.
lets take 1 hidden layers as shown above. in forward propagation, at first layer we will calculate intermediate state a = f(x), this intermediate value pass to output layer and y will be calculated as y = g(a) = g(f(x)). In forward propagation at each layer we are applying a function to previous layer output finally we are calculating output y as a composite function of x . as discussed earlier function f(x) has two parts ( Pre-activation, activation ) . in pre-activation part apply linear transformation and activation part apply nonlinear transformation using some activation functions. lets consider a 1 hidden layer network as shown below
it has 3 input features x1, x2, x3. from each input we are connecting to all hidden layer units. so total weights required for W1 is 3*4 = 12 ( how many connections), for W2 is 3*2 = 6.
First unit in the hidden layer is taking input from the all 3 features so we can compute pre-activation by z₁₁=w₁₁.x₁ +w₁₂.x₂+w₁₃.x₃+b₁ where w₁₁,w₁₂,w₁₃ are weights of edges which are connected to first unit in the hidden layer. we can write same type of pre-activation outputs for all hidden layers, that are shown below
above all equations we can vectorize above equations as below
here m is no of data samples. so we can write Z1 = W1.X+b1
after pre-activation we apply nonlinear function called as activation function. there are many activation function, i am not going deep into activation functions you can check these blogs regarding those — blog1, blog2. below are the those implementations of activation functions.
so our first hidden layer output A1 = g(W1.X+b1).
if we apply same formulation to output layer in above network we will get Z2 = W2.A1+b2 , y = g(z2) . Where g is activation function.
so if we implement for 2 hidden layers then our equations are
Z1 = W1.X + b1
A1 = g1(Z1)
Z2 = W2.A1 + b2
A2 = g2(Z2)
Z3 = W3.A2 + b3
y = A3 = g3(Z3)
similarly we can implement for n layers.
There is another concept called dropout - which is a regularization technique used in deep neural network. dropout refers to dropping out units in a neural network. that is ignore some units in the training phase as shown below. you can check this paper for full reference
in this implementation i used inverted dropout. below are the steps to implement.
- initialize keep_prob with a probability value to keep that unit
- Generate random numbers of shape equal to that layer activation shape and get a boolean vector where numbers are less than keep_prob
- Multiply activation output and above boolean vector
- divide activation by keep_prob ( scale up during the training so that we don’t have to do anything special in the test phase as well )
In my implementation at every step of forward propagation i am saving input activation, parameters, pre-activation output ((A_prev, parameters[‘Wl’], parameters[‘bl’]), Z) for use of back propagation. Forward propagation takes five input parameters as below
X → input data shape of (no of features, no of data points)
hidden layers → List of hidden layers, for relu and elu you can give alpha value as tuple and final layers must be softmax . Ex: [‘relu’,(‘elu’,0.4),’sigmoid’….,’softmax’]
parameters → dictionary that we got from weight_init
keep_prob → probability of keeping a neuron active during dropout [0,1]
seed = random seed to generate random numbers
At every layer we are getting previous layer activation as input and computing ZL, AL. and we are getting cache ((A_prev,WL,bL),ZL) into one list to use in back propagation.
i used cross entropy as a loss function.
What is Cross Entropy?
We can write information content of A = -log₂(p(a)) and Expectation E[x] = ∑pᵢxᵢ . entropy is expected information content i.e. -∑pᵢlog(pᵢ)
Entropy = Expected Information Content = -∑pᵢlog(pᵢ)
let’s take ‘p’ is true distribution and ‘q’ is a predicted distribution. so according to our prediction information content of prediction is -log(qᵢ) but these events will occur with distribution of ‘pᵢ’. then expectation has to be computed over ‘pᵢ’. i.e. Expectation = -∑pᵢlog(qᵢ)
Cross Entropy = -∑pᵢlog(qᵢ)
pᵢ = True Distribution
qᵢ = Predicted Distribution
Implemented compute_cost function and it takes inputs as below
A → Predicted Distribution
Y → True Distribution
parameters → W and b values for L1 and L2 regularization
lambda → Regularization strength
penality → ‘L1’ or ‘L2’ or None
cost = -1/m.∑ Y.log(A) + λ.||W||ₚ where p = 2 for L2, 1 for L1
Backpropagation is a method used to calculate a gradient that is needed in the updation of the weights. The goal of backpropagation is to adjust each weight in the network in proportion to how much it contributes to overall error.
So main aim is to find a gradient of loss with respect to weights as shown in below. This will be done by chain rule.
Lets take same 1 hidden layer network that used in forward propagation and forward propagation equations are shown below.
Z1 = W1.X + b1
A1 = g1(Z1)
Z2 = W2.A1 + b2
A2 = g2(Z2)
lets write chain rule for computing gradient with respect to Weights
So we can observe a pattern from above 2 equations. need to calculate gradient with respect to Z.
Here we observed one pattern that if we compute first derivative dl/dz2 then we can get previous level gradients easily.
our final layer is soft max layer so if we get soft max layer derivative with respect to Z then we can find all gradients as shown in above. below figure tells how to compute soft max layer gradient. for below figure a_Li = Z in above equations.
so dl/dz2 = -(Y-A2) = A2-Y
our back propagation takes input as
AL → probability vector, output of the forward propagation Y → true “label” vector ( True Distribution ) caches → list of caches hidden_layers → hidden layer names keep_prob → probability for dropout penality → regularization penality ‘l1’ or ‘l2’ or None
First we initializes gradients dictionary and will get how many data samples ( m) as shown below.
Next i will start back propagation with final soft max layer and will comute last layers gradients as discussed above
After that i am looping all layers from back ward and calculateg gradients. check below code.
I am not going deeper into these optimization method. i will some intuitive explanations.
SGD: We will update normally i.e. W_new = W_old-learning_rate*gradient
SGD With Momentum:
let’s think in this manner, if i am repeatedly being asked to move in the same direction then i should probably gain some confidence and start taking bigger steps in that direction. This is main idea of momentum based SGD. so we will calculate exponential weighted average of gradients.
In this We will decay the learning rate for the parameter in proportion to their update history. this update history was calculated by exponential weighted avg.
it is RMS Prop + cumulative history of Gradients.
If we put all together we can build a Deep Neural Network for Multi class classification. you can check my total work at my GitHub
Hope you like this article!
References: 1. Deeplearning.ai Course 2. Forward Propagation 3. Back Prop 4. Dropout 5. Applied ai course 6. ML Cheat Sheet 7. CS7015- Deep Learning by IIT Madras 8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting paper 9. https://www.deeplearningbook.org/