Digit Recognition from 0–9 using Deep Neural Network from scratch

--

In Machine learning, Artificial Neural Networks (ANN) play a major role in showcasing the power of statistics and mathematics to solve complex and non-linear problems. ANN has been there in the field of machine learning and Artificial Intelligence for a long time, but recent improvements in computational power and big data is helping them to show how powerful they are. They can be used to solve both supervised and unsupervised, classification and regression problems, computer vision etc. In this article, we shall be implementing an ANN from scratch and apply it to solve a simple problem of detecting digits from 0–9.

Neural Network is similar to logistic regression (perceptron) but with more layers and hidden units. Here’s is what a single layer perceptron (logistic regression) looks like.

Here x1, x2,x3 are the features of the input, w1, w2, w3 are the parameters, b is the bias. Now Z= w1 X1+w2 X2+w3 X3+b. g(z) is the activation function. There are many available activation functions like sigmoid, tanh, relu, leaky relu etc. If the problem is a classification problem, the most common choice of activation function is sigmoid. The formulae and plots of these activation functions are:

Sigmoid:

Tanh:

Relu (Rectified Linear Unit):

Similar to the perceptron model, a simple one-layer neural network looks like :

The input layer is where the features of the input features (X) is fed, the Hidden layer will be the weighted sum of the inputs with parameters (W) followed by an activation function. There will always be bias (b) for each hidden layer. The output of the neural network will be the weighted sum of outputs of previous hidden layer followed by an activation function. For a two-class classifier problem, output layer will be the probability that given training example belongs to the class (cᵢ).

Similarly, a 4 layered deep neural network looks like this:

Now for a single-layered neural network, at hidden layer:

Z₁= W₁ . X+b₁, where Z₁ is the weighted sum of inputs and b₁ is the bias.

X is the input matrix where each training example is stacked horizontally via columns. So dimensions of X are (Nₓ,m) where Nₓ is the number of features, m is the number of training examples. If the number of units in the hidden layer is N₁, then dimensions of W₁ will be (N₁, Nₓ). So dimensions of Z₁ will be (N₁,m). b₁ is a scalar quantity and when we add it in python, it will be added by broadcasting rules. The output of the hidden layer is the result of passing Z₁ to the activation function. Similarly for an N layer neural network, at any hidden layer, dimensions of Wₙ will be (Nₙ, Nn -1).

So g(Z) can be any activation function such as RELU or sigmoid. In practice, we always choose RELU activation function for hidden layers and sigmoid function for output layers. This is because using sigmoid will lead to vanishing gradients. It means derivative of sigmoid dg(z)=g(z)*(1-g(z)). So if z is large, g(z) tends to 1. So the derivative becomes 0. Similarly, if z is small, g(z) tends to 0 leading to derivative becoming 0 again.

Steps involved in Developing a NN model:

  1. Architecture: Choose the neural network architecture i.e number of hidden layers, number of units in hidden layers. In fact, these two are the hyperparameters which are needed to be tuned for better results and improved accuracy.
  2. Parameter Initialization: Initialize parameters randomly. While implementing, we use dictionaries to store parameters and gradients. i.e W₁ is equivalent to parameters[“W”+str(1)] and dW₁ is equivalent to grads[“dW”+str(1)].In python, to initialize parameters, we use:
parameters['W' + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1])*0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l],1))

3. Forward propagation: Once weights are initialized, input X is propagated through hidden units at each layer and produce the output yhat. It usually contains two steps:

3. a: Linear Forward: At any layer l,

where A[0]=X.

3.b: Linear-Activation Forward: Result of Linear Forward step is processed into activation function to produce the output of the layer. Activation function at hidden layers is generally chosen as RELU to avoid vanishing gradient concept and activation function at the output layer is sigmoid as sigmoid’s range will always be within the range of [0,1] which can be interpreted in a probabilistic way.

4. Compute Cost: We use cost function which is used in logistic regression (Cross-Entropy cost). Once we have final layer’s output Aˡ,

Cost= (-1/m)* (y* log(Aˡ)+(1-y)*log(1-Aˡ)). where y is the true output and Aˡ is the output of the network. m is the number of training examples.

5. Backward propagation: This is the most important step in training as it gives the gradients of the parameters and bias which are used in gradient descent step to update weights. Again this step contains two substeps:

5.a.Linear backward: Suppose you have already calculated the derivative dZˡ, we want to get dWˡ, dbˡ, dA[l-1].

We use these three formulae:

5.b. Linear-Activation backward: Now to compute dZˡ, we use the following formula:

where g’ (Zˡ) is the derivative of g(Z) and it can be relu or sigmoid based on the layer for which we are computing. i.e if layer for which we are computing dZˡ is final output layer, then g(Z) is sigmoid and for other layers, g(Z) is relu.

6. Update Weights:

The final step of the training process is to update weights which we get during backward propagation. We use gradient descent to update weights. Update equation is given by:

where α is the learning rate which is a hyperparameter and needs to be tuned properly to achieve better performance.

Implementation from Scratch:

Data preprocessing Steps:

  1. Importing libraries
import numpy as np
import matplotlib.pyplot as plt

2. Load the digits data sets from sklearn.datasets

from sklearn.datasets import load_digits
digits=load_digits()

3. Let’s see the first image in the dataset:

import pylab as pl
pl.gray()
pl.matshow(digits.images[0])
pl.show()

The result will be:

4. Let’s see how our computer looks at this image:

digits.images[0]"""array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]]) """

So each image is an 8x8 pixel image.

5. The target digit of this image is:

digits.target[0]
"""

0
"""

6. Let’s look at first 15 digits images and their corresponding targets:

images_and_labels=list(zip(digits.images,digits.target))
plt.figure(figsize=(5,5))
for index,(image,label) in enumerate(images_and_labels[:15]):
plt.subplot(3,5,index+1)
plt.axis('off')
plt.imshow(image,cmap=plt.cm.gray_r,interpolation='nearest')
plt.title('%i' % label)

7. Let’s define some variables:

#Define variables
n_samples=len(digits.images)
print("Number of samples in the data set is :"+ str(n_samples))

x=digits.images.reshape((n_samples,-1))
print("Shape of input matrix x is : "+str(x.shape))
y=digits.target
print("Shape of target vector y is :"+str(y.shape))
"""
Number of samples in the data set is :1797
Shape of input matrix x is : (1797, 64)
Shape of target vector y is : (1797,)
"""

8. Split the data into training and testing sets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

9. Feature Scaling:

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

10. Convert each training example to a column vector.

X_train=X_train.T
X_test=X_test.T
y_train=y_train.reshape(y_train.shape[0],1)
y_test=y_test.reshape(y_test.shape[0],1)
y_train=y_train.T
y_test=y_test.T

11. The output layer of our model will have 10 units. Among those, the predicted digit is the index of the unit with the highest value. That is for example if the 7th unit in the output layer has the highest value, then the predicted digit will be 6 (as the 7th unit will be at index 6). So we need to convert each training and testing example to have the shape (10,1).

Y_train_=np.zeros((10,y_train.shape[1]))
for i in range(y_train.shape[1]):
Y_train_[y_train[0,i],i]=1
Y_test_=np.zeros((10,y_test.shape[1]))
for i in range(y_test.shape[1]):
Y_test_[y_test[0,i],i]=1

Now with this step, we are done with data preprocessing, let’s implement the Artificial neural network from scratch:

Implementation:

  1. Initialize the parameters for Neural network: It includes all W’s and b’s for every hidden layer:
# initialize parameters for deep neural networks
def initialize_parameters_deep(layer_dims):
np.random.seed(3)
parameters = {}
L = len(layer_dims)
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1])*0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l],1))

assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))
return parameters

Here layer_dims is a list. For example, if we need a neural network with one layer and 6 units in the hidden layer, we need to pass [10,6,10] as layer_dims. Here first 10 corresponds to the number of input features, and last 10 corresponds to the number of units in the output layer.

2. Forward propagation without activation for a single layer (linear_forward):

def linear_forward(A, W, b):
Z = np.dot(W,A)+b
assert(Z.shape == (W.shape[0], A.shape[1]))
cache = (A, W, b)

return Z, cache

3. Useful activation functions and their derivatives

# use ful activation functions and their derivatives
def sigmoid_(Z):
return 1/(1+np.exp(-Z))

def relu_(Z):
return Z*(Z>0)

def drelu_(Z):
return 1. *(Z>0)

def dsigmoid_(Z):
return sigmoid_(Z)*(1-sigmoid_(Z))

def sigmoid(Z):
return sigmoid_(Z),Z

def relu(Z):
return relu_(Z),Z

4. Forward propagation with activation for a single layer: If the type of activation is sigmoid, it performs sigmoid activation function else performs relu activation function.

def linear_activation_forward(A_prev,W,b,activation):
if activation == "sigmoid":
Z, linear_cache = linear_forward(A_prev,W,b)
A, activation_cache = sigmoid(Z)

elif activation == "relu":
Z, linear_cache = linear_forward(A_prev,W,b)
A, activation_cache = relu(Z)

assert (A.shape == (W.shape[0], A_prev.shape[1]))
cache = (linear_cache, activation_cache)

return A, cache

5. Put all together and implement forward propagation for all L layers in the network:

# implementation of forward propogation for L layer neural network
def L_model_forward(X, parameters):
caches = []
A = X
L = len(parameters) // 2
for l in range(1, L):
A_prev = A
A, cache = linear_activation_forward(A_prev,parameters['W'+str(l)],parameters['b'+str(l)],"relu")
caches.append(cache)
AL, cache = linear_activation_forward(A,parameters['W'+str(L)],parameters['b'+str(L)],"sigmoid")
caches.append(cache)
#assert(AL.shape == (1,X.shape[1]))
return AL, caches

So L_model_forward takes input X and parameters as arguments and output the final predicted vector of the output layer and some cache information which is used for backpropagation. For the first L-1 layers, we use relu as activation function and for the last layer, we use sigmoid activation function.

6. Next step is to compute the cost function for the output AL:

# cost function
def compute_cost(AL, Y):
m=Y.shape[1]
cost = -(1/m)*np.sum((Y*np.log(AL)+(1-Y)*np.log(1-AL)))
cost=np.squeeze(cost)
assert(cost.shape == ())
return cost

7. Now let’s move to the final part which is backpropagation. Similar to forward propagation, let’s go step by step starting from linear backward for one particular layer:

def linear_backward(dZ, cache):
A_prev, W, b = cache
m = A_prev.shape[1]
dW = (1/m)*np.dot(dZ,A_prev.T)
db = (1/m)*np.sum(dZ,axis=1,keepdims=True)
dA_prev = np.dot(W.T,dZ)

assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)

return dA_prev, dW, db

8. To compute dZ in the above function, we need activation function specific backward propagation method:

For relu layer:

def relu_backward(dA,activation_cache):
return dA* drelu_(activation_cache)

Similarly for sigmoid layer:

def sigmoid_backward(dA,activation_cache):
return dA* dsigmoid_(activation_cache)

9. For one particular layer, overall backward propagation function is implemented as:

def linear_activation_backward(dA, cache, activation):
linear_cache, activation_cache = cache
if activation == "relu":
dZ = relu_backward(dA,activation_cache)
dA_prev, dW, db = linear_backward(dZ,linear_cache)

elif activation == "sigmoid":
dZ = sigmoid_backward(dA,activation_cache)
dA_prev, dW, db = linear_backward(dZ,linear_cache)
return dA_prev,dW,db

Here dA is the derivative of one particular layer’s activation. The linear_activaton_backward function takes dA and corresponding cache from linear_activation_forward function and outputs the derivative of the activations of the previous layer along with gradients dW and db. This dW, db are used for adjusting parameters W using gradient descent.

10. Now for L layers, backward propagation is computed as follows:

# back propogation for L layers
def L_model_backward(AL, Y, caches):
grads = {}
L = len(caches)
m = AL.shape[1]
#Y = Y.reshape(AL.shape)

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

current_cache = caches[L-1]
grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL,current_cache,"sigmoid")

for l in reversed(range(L-1)):
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+1)],current_cache,"relu")
grads["dA" + str(l)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
return grads

11. With backpropagation is successfully implemented we need to adjust the weights with the help of computed derivatives (dW and db).

#update parameters
def update_parameters(parameters, grads, learning_rate):
L = len(parameters) // 2
for l in range(L):
parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-(learning_rate)*grads["dW"+str(l+1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)]-(learning_rate)*grads["db"+str(l+1)]
return parameters

12. Now putting all the above steps, we shall implement a neural network with 3 hidden layers and train the model.

# N layer neural network
layers_dims=[n_x,60,10,10]

def L_layer_model(X, Y, layers_dims, learning_rate = 0.005, num_iterations = 3000, print_cost=False):
np.random.seed(1)
costs = []

parameters = initialize_parameters_deep(layers_dims)

for i in range(0, num_iterations):
AL, caches = L_model_forward(X, parameters)
cost = compute_cost(AL, Y)
grads = L_model_backward(AL, Y, caches)
parameters = update_parameters(parameters, grads, learning_rate)
if print_cost and i % 1000 == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % 1000 == 0:
costs.append(cost)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()

return parameters

Training, Prediction, and Visualization:

We have successfully implemented all the required functions like forward propagation, computing cost function, backward propagation, and update parameters. Now let’s train the model and check how cost function is decreasing with every iteration.

parameters = L_layer_model(X_train, Y_train_, layers_dims, num_iterations = 50000, print_cost = True)

We can see that after every 1000 iterations, the cost is decreasing. This shows our model is trained well, but let’s see how it will predict the unseen data and analyze some of the implementation tips.

Prediction:

def predict_L_layer(X,parameters):
AL,caches=L_model_forward(X,parameters)
prediction=np.argmax(AL,axis=0)
return prediction.reshape(1,prediction.shape[0])

Let’s test with Training examples to know the percentage of training accuracy.

predictions_train_L = predict_L_layer(X_train, parameters)print("Training Accuracy : "+ str(np.sum(predictions_train_L==y_train)/y_train.shape[1] * 100)+" %")""" Output
Training Accuracy : 100.0 %
"""

Wow, so it looks our model has been trained perfectly well to recognize all the training set images. Now let’s try our luck with testing set images to know how accurately our model is predicting the unseen images.

predictions_test_L=predict_L_layer(X_test,parameters)
print("Testing Accuracy : "+ str(np.sum(predictions_test_L==y_test)/y_test.shape[1] * 100)+" %")
""" Output
Testing Accuracy : 97.22222222222221 %
"""

Hmm, Not bad, so our testing accuracy is also close to 98% which seems to be perfect. Let’s visualize some of the predictions from the testing set and look at how accurately our model is recognizing the digits from 0–9.

import random
for j in range(15):
i=random.randint(0,n_samples)
pl.gray()
pl.matshow(digits.images[i])
pl.show()
img=digits.images[i].reshape((64,1)).T
img = sc.transform(img)
img=img.T
predicted_digit=predict_L_layer(img,parameters)
print('Predicted digit is : '+str(predicted_digit))
print('True digit is: '+ str(y[i]))

Great, we have implemented and tested our machine learning model which recognizes even the low-quality blurry images with great accuracy. This is a difficult task because even humans might not recognize some of the blurry images some times and now our machine learning model is giving promising results.

Some of the implementations tips while training neural networks are:

  1. We will not get these results on the very first attempt, in fact when I trained this model for the first time, it’s nowhere near to the current model. So don’t lose hope and try different values of learning rate, number of layers, number of hidden units in each layer etc.
  2. In this article, we have used 3 hidden layers with 60,10, 10 units in each hidden layer. It’s important to know how to choose this number of hidden layers and the number of units in each hidden layer.
  3. Always monitor the cost function. If your cost function is not decreasing with iterations, then there is surely a problem with your model.

In fact, all these issues and implementation tips can be covered under “Hyperparameter Tuning and Optimization” topic which will be discussed in upcoming articles. There are a lot of Hyperparameters that are necessary to tune such as learning rate (α), number of units in each hidden layer, number of hidden layers, regularization parameter (λ) etc.

This is one of the simplest application of deep-learning and we shall be covering many more in upcoming articles.

Thank you.

Full code can be found here.

--

--