Learn how to Build Neural Networks from Scratch in Python for Digit Recognition

Andrew Ng’s machine learning course continues to be a stepping stone and a gateway for thousands of aspiring data scientists. But the question of how to implement his teachings using modern day languages has often been an obstacle for people.

And that’s why I continue penning down my thoughts on each week’s lesson and how to implement all his teachings in Python.

In my last post, we saw how a simple algorithm like logistic regression can be used to recognize handwritten digits. We got an accuracy of 95.08%! However, keep in mind that Logistic Regression is a linear classifier and hence cannot form complex boundaries.

What we’ll cover in this post

So in this blog post, we will learn how a neural network can be used for the same task. As neural networks can fit more complex non-linear boundaries, we should see an increase in our classifier accuracy too.

This part of the post is based on Andrew Ng’s Machine Learning course week 5 content. You can access the programming exercises and the dataset here.

Here I am not going to explain the concepts of Backpropagation and the like, because it would deviate from the goal of providing you the Pythonic translation of the course. Each concept can quickly become a blog post in itself and I honestly think Andrew Ng has done a very good job at explaining the concepts.

Before starting on the programming exercise, we strongly recommend watching the video lectures and completing the review questions for the associated topics.

1. Feedforward Propagation

We first implement feedforward propagation for neural network with the already given weights. Then we will implement the backpropagation algorithm to learn the parameters for ourselves. Here we use the term weights and parameters interchangeably.

1.1 Visualizing the data:

Each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location. The 20 by 20 grid of pixels is “unrolled” into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix X. This gives us a 5000 by 400 matrix X where every row is a training example for a handwritten digit image. The second part of the training set is a 5000-dimensional vector y that contains labels for the training set.

from scipy.io import loadmat
import numpy as np
import scipy.optimize as opt
import pandas as pd
import matplotlib.pyplot as plt
# reading the data
data = loadmat('ex4data1.mat')
X = data['X']
y = data['y']
# visualizing the data
_, axarr = plt.subplots(10,10,figsize=(10,10))
for i in range(10):
for j in range(10):
axarr[i,j].imshow(X[np.random.randint(X.shape[0])].\
reshape((20,20), order = 'F'))
axarr[i,j].axis('off')

1.2 Model Representation

Our neural network has 3 layers — an input layer, a hidden layer and an output layer. Do recall that the inputs will be 20 x 20 grey scale images “unrolled” to form 400 input features which we will feed into the neural network. So our input layer has 400 neurons. Also the hidden layer has 25 neurons and the output layer 10 neurons corresponding to 10 digits (or classes) our model predicts. The +1 in the above figure represents the bias term.

We have been provided with a set of already trained network parameters. These are stored in ex4weights.mat and will be loaded into theta1 and theta2 followed by unrolling into a vector nn_params. The parameters have dimensions that are sized for a neural network with 25 units in the second layer and 10 output units (corresponding to the 10 digit classes).

weights = loadmat('ex4weights.mat')
theta1 = weights['Theta1'] #Theta1 has size 25 x 401
theta2 = weights['Theta2'] #Theta2 has size 10 x 26
nn_params = np.hstack((theta1.ravel(order='F'), theta2.ravel(order='F')))    #unroll parameters
# neural network hyperparameters
input_layer_size = 400
hidden_layer_size = 25
num_labels = 10
lmbda = 1

1.3 Feedforward and cost function

First we will implement the cost function followed by gradient for the neural network (for which we use backpropagation algorithm). Recall that the cost function for the neural network with regularization is

cost function of neural network with regularization

where h(x(i)) is computed as shown in the Figure 2 and K = 10 is the total number of possible labels. Note that h(x(i)) = a(3) is the activations of the output units. Also, whereas the original labels (in the variable y) were 1, 2, …, 10, for the purpose of training a neural network, we need to recode the labels as vectors containing only values 0 or 1, such that

one-hot encoding

This process is called one-hot encoding. The way we do this is by using the get_dummies function from the ‘pandas library’.

sigmoid function

def sigmoid(z):
return 1/(1+np.exp(-z))

cost function

calling nnCostFunc using the given weights gives us the cost.

nnCostFunc(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lmbda)

You should see that the cost is about 0.383770.

2 Backpropagation

In this part of the exercise, you will implement the backpropagation algorithm to compute the gradients for the neural network. Once you have computed the gradient, you will be able to train the neural network by minimizing the cost function using an advanced optimizer such as fmincg.

2.1 Sigmoid gradient

We will first implement the sigmoid gradient function. The gradient for the sigmoid function can be computed as

def sigmoidGrad(z):
return np.multiply(sigmoid(z), 1-sigmoid(z))

2.2 Random initialization

When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. Here we randomly initialize parameters named initial_theta1 and initial_theta2 corresponding to hidden layer and output layer and unroll into a single vector as we did earlier.

2.3 Backpropagation

Backpropagation is not so complicated algorithm once you get the hang of it.
I strongly urge you to watch the Andrew’s videos on backprop multiple times.

In summary we do the following by looping through every training example:
1. Compute the forward propagate to get the output activation a3.
2. Calculate the error term d3 that’s obtained by subtracting actual output from our calculated output a3.
3. For hidden layer, error termd2 can be calculated as below:

4. Accumulate the gradients in delta1 and delta2 .
5. Obtain the gradients for the neural network by diving the accumulated gradients (of step 4) by m.
6. Add the regularization terms to the gradients.

By the way, the for-loop in the above code can be eliminated if you can use a highly vectorized implementation. But for those who are new to backprop it is okay to use for-loop to gain a much better understanding. Running the above function with initial parameters gives nn_backprop_Params which we will be using while performing gradient checking.

nn_backprop_Params = nnGrad(nn_initial_params, input_layer_size, hidden_layer_size, num_labels, X, y, lmbda)

2.4 Gradient checking

Why do we need Gradient checking ? To make sure that our backprop algorithm has no bugs in it and works as intended. We can approximate the derivative of our cost function with:

The gradients computed using backprop and numerical approximation should agree to at least 4 significant digits to make sure that our backprop implementation is bug free.

checkGradient(nn_initial_params,nn_backprop_Params,input_layer_size, hidden_layer_size, num_labels,X,y,lmbda)
outputs of gradient check

2.5 Learning parameters using fmincg

After you have successfully implemented the neural network cost function and gradient computation, the next step is to use fmincg to learn a good set of parameters for the neural network. theta_opt contains unrolled parameters that we just learnt which we roll to get theta1_opt and theta2_opt.

2.6 Prediction using learned parameters

It’s time to see how well our newly learned parameters are performing by calculating the accuracy of the model. Do recall that when we used linear classifier like Logistic Regression we got an accuracy of 95.08%. Neural network should give us a better accuracy.

pred = predict(theta1_opt, theta2_opt, X, y)
np.mean(pred == y.flatten()) * 100

This should give a value of 96.5% (this may vary by about 1% due to the random initialization). It is to be noted that by tweaking the hyperparameters we can still obtain a better accuracy.

End Notes

We just saw how neural networks can be used to perform complex tasks like digit recognition, and in the process also got to know about backpropagation algorithm.

Thanks for making it this far. If you liked my work, give me a clap (or several claps).