Chapter 7 : Artificial neural networks with Math.

I have been talking about the machine learning for a while, I wanna talk about Deep learning as I got bored of ML.

so this article we will talk about Neural networks which are part of deep learning which is part of machine learning. Let’s get started!!!!!!

Note: Understanding of Math from previous article GD is required a bit.

as usual before I dive into deep learning, I wanna point this.

Why deep learning???

We already have a lot of algorithms in machine learning which we still don’t understand many, so why there is a pain of leaning new called “Deep learning” ?????

well, there are plenty of reasons for why’s from researchers and other scientists, as a machine learning scientist I believe in few which are below

  1. Deep learning is preferred than shallow level learning when you have enormous amount of data (either labeled or not).
  2. Awesome state-of-the-art performance in tasks involving text, sound, or image. many advances in Computer vision,NLP and speech recognition.
  3. Feature representation or abstract representation, we don’t need to spend time on feature engineering much.

etc… there are a lot actually

Note: Just because DL is cool, does not mean that we don’t need to use ML techniques,(Based data and problem we (‘I) choose models, algorithms , frame works and tools)

Don’t get caught up!

Okay man got it let’s go ahead and tell us about Neural networks.

Hmm, let’s first understand the neuron

Neuron is a computational unit which takes the input(‘s) , does some calculations and produces the output. that’s it no big deal.

Above, the 2nd picture is the one we use in neural networks, we have the input and we have some weights(parameters) we apply the dot product of these two vectors and produce the result (which would be a continuous value -infinity to + infinity).

if we want to restrict the output values we use an Activation function.

The activation function squashes the output value and produce a value within a rage (which is based on the type of activation function).

We use often these three (Sigmoid range from 0 to 1, Tanh from -1 to 1 and Relu from 0 to +infinity).

that’s the neuron.

A neural network is a set of layers(a layer has set of neurons) stacked together sequentially.


The output of one layer would be the input of the next layer.

Here we have three layers

  1. Input layer: A set of input neurons where each neuron represents each feature in our dataset. It takes the inputs and pass them to the next layer.
  2. Hidden layer: A set of (n) no of neurons where each neuron has a weight(parameter) assigned to it. It takes the input from previous layer and does the dot product of inputs and weights, applies activation function (as we have seen above),produce the result and pass the data to next layer.

Note:We can have (n) no of hidden layers in between.(for sake of understanding let’s take only one hidden layer).

3. Output layer: it’s same hidden layer except it gives the final result(outcome/class/value).

so How do we define no of neurons in each layer and the whole network???

well, Input layer’s neurons are based on no of features in the dataset.

N_Features= N_i/p_neurons+1(bias)

we can define as many neurons/layers as we wish (it depends on the data and problem) but would be good to define more than features and all hidden layers have same no of neurons.


Output layer’s neurons are based the type of problem and outcomes.

if regression then 1 neuron ,for binary classification you can have 1 or 2 neurons. and for multi classification more than 2 neurons.

Note: there is no bias here as it is the last layer in the network.

We got the basic understanding of neural network so let’s get into deep.

Let’s understand how neural networks work.

Once you got the dataset and problem identified, you can follow the below Steps:

1. Pick the network architecture(initialize with random weights)
2. Do a forward pass (Forward propagation)
3. Calculate the total error(we need to minimize this error)
4. Back propagate the error and Update weights(Back propagation)
5. Repeat the process(2-4)for no of epochs/until error is minimum.

There are 2 algorithms in Neural networks

1.Forward propagation.

2.Back propagation.

1. Pick the network architecture 

Lets take a toy dataset (XOR) and pick the architecture with

2 inputs , 2 outputs and 1 hidden of 3 neurons.

2.Forward propagation

This is a simple process, we feed forward the inputs through each layer in the network , the outputs from the previous layer become the inputs to the next layer.(first we feed our data as the inputs)

First we provide the inputs(example) from our dataset ,

dataset (XOR table) 
X y
1 1 0 --> X1=1 and X2=1
1 0 1 H1 = Sigmoid(X1*w1+X2*w2) = 0.5(assume with random
0 1 1 weights)
0 0 0 similarly H2, H3 and O1, O2
3.Calculate the total error.

Assume random weights and Activation(A1,2…) we get the errors for each neuron.

sum = inputs*weights and A = activation(sum) here Sigmoid(sum).

Out cost function from Andrew Ng is

Note: we take partial derivative w.r.t result (by using Chain rule in calculus)

4. Back propagation

Trust me it’s easy! or I will make it easy.

The main goal of backpropagation is to update each of the weights in the network so that they cause the predicted output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.

So far we got the total error which is to be minimized.

if you know how gradient descent works , the rest is pretty easy , if you don’t know, here is my article that talks about Gradient descent.

We need to calculate the below terms

  1. how much does the total error change with respect to the result? (or how much is a change in results) already we did in the above picture.
  2. Next, how much does the result of change with respect to its sum? (or how much is a change in sum)
  3. Finally, how much does the sum of change with respect to weights? (or how much is a change in weights)

Well, that’s it.

5. Repeat the process(2-4)for no of epochs/until error is minimum.

We repeat the process forwarding the weights(FP) and updating weights(BP) for no of epochs or we reach the minimum error.

Once the training process is done, we can do the prediction by feed forwarding the input to the trained network, that’s it.

Hope its not confusing , and if you are not good at derivatives, you can let me know I can help, but I am sure that this will make sense as you go through again and again.

I put lot of efforts in adding the Math stuff and diagrams I feel pictures are awesome than words so please let me know if it helps.

Suggestions /questions are welcome.

So That’s it for this story , In the next story I will build the neural network from scratch using the above steps and same Math.

Until then

See ya!