Neural Network approach to classify handwritten numerals using Softmax layer

This is my first post regarding deep learning. I have just taken Andrew NG’s Deep learning specialization which i would recommend to people looking for kick start with this beautiful field. Also I would be writing these posts on regular basis so keep tuned. This post is for for people having some background in data science but beginners won’t find it difficult to get a sense of it.

In this post I am going to build a Classifier using vanila neural net. There are lots of frameworks in market to build a network in few steps like tensorflow, mxnet etc but to understand how neural nets work, it's necessary to first build them from scratch.

MNIST dataset is very popular can one can search for it on wikipedia. It has about 55k handwritten numerals consisting of numeral from 0 to 9. Sample image is shown below.

MNIST dataset image for numeral 6

Each image is 28x28 making total number of features 784.

Lets introduce some vocab:

  1. NN: Neural Network
  2. MCC: multi class classification
  3. np: numpy
  4. back prop: back propagation

First lets look at the basics of neural network:

Courtesy: Wikipedia

It consist of input, hidden and output layers. In fully connected NN, every node is connected to each node of adjacent layers. In forward propagation, we propagate the input values, multiplied by weights between nodes and then calculate the cost of error at output layer. Then we do a back propagation where the gradient of weights with respect to cost are calculated and the weights are updated accordingly. We repeat this process for multiple iterations or epochs and once cost converges, we have got out final parameters which can be used to predict values using forward pass. Without going into much of mathematics, I will dive straight to implementation aspect.

The neural net I am going to use have 3 layers, consisting of [784(In) , 50, 20 , 10(Out)] layers. Though best accuracy is given by [784, 800, 10] layer of network, for demonstration, it is fine. The approach I have heavily borrowed from what I have learnt in Andrew NG’s specialization. However I have not used any framework and this is a vanila neural network. The approach for any neural network is :

  1. load train and test dataset
  2. initialize the params.
  3. define function for forward propagation
  4. define function for backpropagation
  5. define function to update parameters
  6. using these functions, do forward and backward pass multiple times.

Initialize parameters:

Let’s initialize the parameters, we are concerned with W and b, where W are weights between nodes and b is bias of the nodes. We do a random initialization of weights using numpy as:

W1 = np.random.randn(784, 50)*10

Where 784 and 50 are nodes in layer 1 and 2 respectively. We can define other weights similarly.

Forward Prop

In forward pass, we calculate the output by passing input values through weights. For example

Z1 =, X) + b1

where Z1 is the value at node 1 of hidden layer 1. Then we transform the Z1 to calculate the output at hidden layer 1 using relu.

Z1 = relu(Z1)

In the last layer, we use softmax function which is used in multi class classification. For one example, it is given by:


Here C are the number of classes, which in our case is 0–9 ie C=10

For two class classification we use sigmoid function and softmax function works on the same basis. So output layer value is given by:

A3 = softmax(Z3)

Cost computation

It is important to understand the cost of softmax, since it is different than binary classifier. For a particular example, we take the sum of cross entropy in as the cost for that example and sum it over all example:

cost(i)=np.sum(np.log(A3)*Y, axis=0)

Back Propagation

Back propagation is critical step, while it is usual for other layers, for output layer, the gradients are calculated as:

dZ3 = A3 — Y #simply the difference between observed and calculated value

dZ3 is actually d(cost)/d (Z3) , you can do the differentiation to find this but as per this link, it comes simply as A3-Y

dW3 = dZ3*A2

backpropagation function will return the gradients, using which we will do the updation of parameters using gradient descent as :

W1 = W1 — learning_rate*dW1

Simple and now our model is ready.


The model is nothing but the serial execution of forward pass and backward passes. Below is the pseudo code for model

initialize params

for i in range(num_iterations){

do forward propagation

compute cost

do back prop

do parameters updation


Using the above pseudo code, we should see the costs decreasing as iterations progresses. For me(with Adam optimization and not gradient descent and mini batch) the cost vs iteration curve was given below. With Gradient descent, the NN converges slowly and can take upto 1000 epochs.

Cost after epoch 0: 7.634752
Cost after epoch 10: 4.133708
Cost after epoch 20: 9.008277
Cost after epoch 30: 1.945165
Cost after epoch 40: 3.374583
Cost after epoch 50: 0.216257
Cost after epoch 60: 0.603547
Cost after epoch 70: 0.476524
Cost after epoch 80: 0.193259
Cost after epoch 90: 0.025203
Cost after epoch 100: 0.327139
Cost after epoch 110: 0.264492
Cost after epoch 120: 1.689923
Cost after epoch 130: 0.035752
Cost after epoch 140: 0.082334
Cost after epoch 150: 0.010703
Cost after epoch 160: 0.081757
Cost after epoch 170: 0.095393
Cost after epoch 180: 0.053605
Cost after epoch 190: 0.041417
Accuracy Train: 1.0
Accuracy Test: 0.9688
Accuracy on test data is ~ 97% which is not the best but I hope you have now got the approach to build a multi class classifier using Neural networks. I have not added any source code with this post, but interested people can contact me to get that and see the real feel of how the code works. The difference between my approach and the one taught by Andrew NG for MCC is that he used tensor flow for classification while I have not used any framework.

Let me know of any questions and doubts and keep tuned for next post.