Digit Classifier using Neural Networks

Published in

CodeX

7 min readOct 1, 2021

Source: https://www.behance.net/gallery/81929059/Neural-Network

Hey all, In this post, I’ll show you how to build a beginner-friendly framework for building neural networks in Python. The primary objective of this code is to help novices learn the fundamentals of neural networks. And we will recognize hand-written digits using Neural Networks. The neural networks will be able to represent complex models that form non-linear hypothesis. If this doesn’t make sense to you, don’t worry this post will help understand. If you’re unfamiliar with neural networks, read my earlier post to learn the fundamental ideas of neural networks.(Click here to navigate to my previous post.)

Model Representation

Our neural network is shown above. It has 3 layers → an input layer, a hidden layer and an output layer. Since we are working with pictures, our neural network cannot accept an image as input; instead, we must provide pixels from an image as input(Note: Images are made up of pixels). To ensure that all of the pictures are the same size, we will scale them to 20x20 pixels. By unrolling them into 1D array, it gives us 400D vector which will act as input layer units for our neural network(excluding the extra bias unit which will always outputs +1).

Let’s import the require modules and load our dataset,

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
import matplotlib.image as img

mat = loadmat('ex4data1.mat')
X = mat['X']
y = mat['y']
X.shape, y.shape

X.shape, y.shape

Let’s visualize our dataset by the following command,

fig, axis = plt.subplots(10, 10, figsize=(8, 8))
for i in range(10):
    for j in range(10):
        axis[i, j].imshow(
            X[np.random.randint(0, 5001), :].reshape(20, 20, order='F'), cmap='gray')
        axis[i, j].axis('off')

Sigmoid

We have spoke more about this in our earlier post. So, i will just skip the explanation. Basically, sigmoid is an activation function that takes a real-valued input and squashes it to range between 0 and 1.

def sigmoid(z):
    return 1/(1+np.exp(-z))

Forward Propagation and Cost Function

Source: Antonio Rafael Sabino Parmezan from Researchgate

The above picture shows forward propagation of one layer in neural network. The formula for forward propagation is as follows:

Forward propagation for 3 layer neural network

We set x(input) as a¹, then we multiply a¹ with θ¹ (i.e., weights w¹ as depicted in the above picture) and add bias (i.e., b or θ₀¹) at the end we will send the dot product of a¹ and θ¹ into an activation function in our case sigmoid function. This is repeated for all 400 values in the input layer and all the values in the hidden layer. To find the good parameters, below cost function is used:

Here the cost function looks similar to Logistic Regression’s cost function but with extra regularization term which helps to improve accuracy of our algorithm. This cost function helps us to learn good parameters.

Backpropagation

Backpropagation is the technique used to change the weights and biases, so that the neural network’s output gets more accurate. We moved from left to right in forward propagation, but we move from right to left in backward propagation. Let us consider simple neural network:

Backward propagation is just taking derivatives of forward function but from right. If the below derivation doesn’t make sense to you don’t worry it’s definitely OK, the below derivation is for those who are familiar in calculus.

Sigmoid gradient will be a helpful function to compute the gradients of sigmoid which is a(1-a). The formula for backpropagation for our neural network is:

def costFunction(nn_params, X, y, input_layer_size, hidden_layer_size, num_labels, Lambda):
    
    Theta1 = nn_params[:((input_layer_size+1) * hidden_layer_size)].reshape(hidden_layer_size, input_layer_size+1)
    Theta2 = nn_params[((input_layer_size+1) * hidden_layer_size):].reshape(num_labels, hidden_layer_size+1)
    
    #Feedforward and Cost Function
    m = X.shape[0]
    X = np.column_stack((np.ones((m ,1)), X)) #5000 x 401
    a2 = sigmoid(X@Theta1.T) #5000 x 25
    a2 = np.hstack((np.ones((m, 1)), a2)) #5000 x 26
    a3 = sigmoid(a2@Theta2.T) #5000 x 10
    
    y_matrix = np.zeros((m, num_labels)) #5000 x 10
    for i in range(1, num_labels+1):
        y_matrix[:, i-1][:, np.newaxis] = np.where(y==i, 1, 0)
        
    J = np.sum(np.sum( -y_matrix * np.log(a3) - (1 - y_matrix) * np.log(1 - a3) ))  
    reg = Lambda/(2*m) * (np.sum(Theta1[:, 1:]**2) + np.sum(Theta2[:, 1:]**2))
    
    J = (1/m) * J
    reg_J = J + reg
    
    grad1 = np.zeros((Theta1.shape))
    grad2 = np.zeros((Theta2.shape))
    
    for i in range(m):
        xi = X[i, :] #1 x 401
        a2i = a2[i, :] #1 x 26
        a3i = a3[i, :] #1 x 10
        
        d3 = a3i - y_matrix[i, :]
        d2 = (Theta2.T @ d3.T) * sigmoidGradient(np.hstack((1, xi @ Theta1.T)))
        
        grad1 = grad1 + d2[1:][:, np.newaxis] @ xi[:, np.newaxis].T
        grad2 = grad2 + d3.T[:, np.newaxis] @ a2i[:, np.newaxis].T
    
    grad1 = 1/m * grad1
    grad2 = 1/m * grad2     
    grad1_reg = grad1 + Lambda/m * np.hstack((np.zeros((Theta1.shape[0], 1)), Theta1[:, 1:]))
    grad2_reg = grad2 + Lambda/m * np.hstack((np.zeros((Theta2.shape[0], 1)), Theta2[:, 1:]))
        
    return J, grad1, grad2, reg_J, grad1_reg, grad2_reginput_layer_size = 400
hidden_layer_size = 25
num_labels = 10

nn_params = np.append(Theta1.flatten(), Theta2.flatten())
J, reg_J = costFunction(nn_params, X, y, input_layer_size, hidden_layer_size, num_labels, 1)[0:4:3]

print(f"Cost at parameters(non-regularized): {J}\nCost at parameters(Regularized): {reg_J}")

Random Initialization

In neural networks we should not initialize θ’s as zeros which makes our neural network symmetry(i.e., every unit detects the same features), when we multiply our input with θ(which is zero) we will always get zeros as output. So, to break symmetry(i.e., every unit should detect different feature like edges, horizontal lines, etc.,) we initialize θ’s randomly. One effective strategy for random initialization is to randomly select values for θ uniformly in the range[-ϵᵢₙᵢₜ,ϵᵢₙᵢₜ](where ϵᵢₙᵢₜ=0.12).

def randomInitailization(L_in, L_out):
    epi = np.sqrt(6)/np.sqrt(L_in+L_out)
    W = np.random.rand(L_out, L_in+1) * 2*epi - epi
    return W
initial_Theta1 = randomInitailization(input_layer_size, hidden_layer_size)
initial_Theta2 = randomInitailization(hidden_layer_size, num_labels)
initial_nn_params = np.append(initial_Theta1.flatten(), initial_Theta2.flatten())

Gradient Descent

Since we have θ₁ and θ₂ to learn, gradient descent algorithm will have a slight difference the previous ones.

def gradientDescent(initial_nn_params, X, y, input_layer_size, hidden_layer_size, num_labels, alpha, num_iters, Lambda):
    
    Theta1 = initial_nn_params[:((input_layer_size+1) * hidden_layer_size)].reshape(hidden_layer_size, input_layer_size+1)
    Theta2 = initial_nn_params[((input_layer_size+1) * hidden_layer_size):].reshape(num_labels, hidden_layer_size+1)
    
    m = len(y)
    J_history = []
    
    for i in range(num_iters):
        nn_params = np.append(Theta1.flatten(), Theta2.flatten())
        cost, grad1, grad2 = costFunction(nn_params, X, y, input_layer_size, hidden_layer_size, num_labels, Lambda)[3:]
        Theta1 = Theta1 - (alpha * grad1)
        Theta2 = Theta2 - (alpha * grad2)
        J_history.append(cost)
    nn_params_final = np.append(Theta1.flatten(), Theta2.flatten())
    
    return nn_params_final, J_historynn_params, J_history = gradientDescent(initial_nn_params, X, y, input_layer_size, hidden_layer_size, num_labels, 0.8, 800, 1)
Theta1 = nn_params[:((input_layer_size+1) * hidden_layer_size)].reshape(hidden_layer_size, input_layer_size+1)
Theta2 = nn_params[((input_layer_size+1) * hidden_layer_size):].reshape(num_labels, hidden_layer_size+1)

Predictions

We can get predictions by doing forward propagation once.

def predict(Theta1, Theta2, X):
    m = X.shape[0]
    X = np.hstack((np.ones((m, 1)), X))
    a2 = sigmoid(X @ Theta1.T)
    a2 = np.hstack((np.ones((m, 1)), a2))
    a3 = sigmoid(a2 @ Theta2.T)
    return np.argmax(a3, axis=1)+1pred = predict(Theta1, Theta2, X)
print(f"Accuracy = {np.mean(pred[:, np.newaxis]==y) * 100}%")

It will show the accuracy around 95%. It is good to classify handwritten digits.

Conclusion

Today, we saw under the hood of Neural Networks and how it actually works. Then it was created from scratch using python’s numpy, pandas and matplotlib. The dataset and final code is uploaded in github.

Check it out here Neural Networks.

If you like this post, then check out my other posts in this series about

1. What is Machine Learning?

2. What are the Types of Machine Learning?

3. Uni-Variate Linear Regression

4. Multi-Variate Linear Regression

5. Logistic Regression

6. What are Neural Networks?

7. Image Compressing with K-means Clustering

8. Dimensionality Reduction on Face using PCA

9. Detect Failing Servers on a Network using Anomaly Detection

Last Thing

If you enjoyed my article, a clap 👏 and a follow would be ⚡neuralistic⚡ and it is helpful for medium to promote this article so that others may read it. I am Jagajith and I will catch you in the next one.