MNIST(hand written digit) Classification Using Neural Network(Step by Step) From Scratch

6 min readAug 31, 2023

What is Neural Network ?

Neural networks, also known as artificial neural networks (ANNs), are a type of machine learning that power deep learning algorithms. Their name and form are inspired by the human brain, and they replicate the way biological neurons communicate with one another.In order to learn and gain accuracy over time, neural networks rely on training data. However, these learning algorithms become effective tools in computer science and artificial intelligence.

Let’s explore, how neural network works.

Consider every node to be a separate linear regression model, with input data(X), weights(w), a bias(b), and an output(Z). The formula should look like this:

is the method by which a neural network produces an output for a specific input. The final layer’s output is also referred to as the neural network’s prediction. We will go through how we evaluate predictions later on in this article. These assessments can be used to determine whether or not our neural network needs to be improved.

We compute the cost function immediately following the output generation by the final layer. The cost function calculates how far off from the desired predictions our neural network is. The cost function’s value demonstrates the discrepancy between the true value and the forecasted value.

Here is one of the cost function:

Here , 1/m scales the loss results, yi represents the actual ouput and log(yhat) represents the predicted output.The most interesting thing about this loss function is the negative sign at the starting, it just tradeoff the logarithmic negative output as log(close to 0) gives negative value.

BackPropagation

As a machine-learning algorithm, backpropagation performs a backward pass to adjust a neural network model’s parameters, aiming to minimize the Loss.

Let’s see how it compute gradients

Compute gradient of loss with respect to w2 ,image:MIT 6.S191

Compute gradient of loss with respect to w1 ,image:MIT 6.S191

Gradient Descent

The parameters of the neural network are updated using gradient descent This algorithm modify the weights and biases of each layer in the network based on how the minimization of the cost function will be impacted. Backpropagation is used to calculate the impact on the weights and biases of each input neuron in the network on the minimization of the cost function.

Derivative of loss function with respect to (w,b) and Compute gradients for loss optimization.Here J is loss function and gradient w represents the direction in space. The goal here is to reach the global minima.

Update gradients

Here alpha is the learning rate ,w is old weight and dl/dw is derivative loss with respect to weight.For new bias we did the same.

Let’s implement these ideas into code

Initialize parameter for each layer of the network.You have the flexibility to initialize weight in different kinds of distribution for ex: uniform,random normal distribution or any other distribution if you want.Here i use numpy randn which is basically generate standard normal distribution.

def initialize_parameters(self):
        np.random.seed(42)
 
        for l in range(1, len(self.layers_size)):
            self.parameters["W" + str(l)] = np.random.randn(self.layers_size[l], self.layers_size[l - 1]) / np.sqrt(
                self.layers_size[l - 1])
            self.parameters["b" + str(l)] = np.zeros((self.layers_size[l], 1))

Forward Propagation

def forward(self, X):
        dict = {}
 
        A = X.T
        for l in range(self.L - 1):
            Z = self.parameters["W" + str(l + 1)].dot(A) + self.parameters["b" + str(l + 1)]
            A = self.sigmoid(Z)
            dict["A" + str(l + 1)] = A
            dict["W" + str(l + 1)] = self.parameters["W" + str(l + 1)]
            dict["Z" + str(l + 1)] = Z
 
        Z = self.parameters["W" + str(self.L)].dot(A) + self.parameters["b" + str(self.L)]
        A = self.softmax(Z)
        dict["A" + str(self.L)] = A
        dict["W" + str(self.L)] = self.parameters["W" + str(self.L)]
        dict["Z" + str(self.L)] = Z
 
        return A, dict

Backpropagation

def backward(self, X, Y, dict):
 
        derivatives = {}
 
        dict["A0"] = X.T
 
        A = store["A" + str(self.L)]
        dZ = A - Y.T
 
        dW = dZ.dot(dict["A" + str(self.L - 1)].T) / self.batch
        db = np.sum(dZ, axis=1, keepdims=True) / self.batch
        dAPrev = dict["W" + str(self.L)].T.dot(dZ)
 
        derivatives["dW" + str(self.L)] = dW
        derivatives["db" + str(self.L)] = db
 
        for l in range(self.L - 1, 0, -1):
            dZ = dAPrev * self.sigmoid_derivative(dict["Z" + str(l)])
            dW = 1. / self.batch * dZ.dot(dict["A" + str(l - 1)].T)
            db = 1. / self.batch * np.sum(dZ, axis=1, keepdims=True)
            if l > 1:
                dAPrev = dict["W" + str(l)].T.dot(dZ)
 
            derivatives["dW" + str(l)] = dW
            derivatives["db" + str(l)] = db
 
        return derivatives

Update gradients with mini batch

The mini-batch is a fixed number of training examples that is less than the actual dataset. So, in each iteration, we train the network on a different group of samples until all samples of the dataset are used.You have the flexibility to choose the number of batch size. In theory it’s prudent to choose any number power of base 2.

def fit(self, X, Y, learning_rate=1, n_iterations=10,batch=32):
        np.random.seed(1)
        self.batch = batch
        for loop in range(n_iterations):

            mini_batches = self.create_mini_batches(X, Y, self.batch)
            loss = 0
            acc = 0
            for mini_batch in mini_batches:
                X_mini, y_mini = mini_batch
                A, store = self.forward(X_mini)
                loss += -1*np.mean(y_mini * np.log(A.T+ 1e-8))# CCE cost function A.T is updated weight 
                derivatives = self.backward(X_mini, y_mini, store)
     
                for l in range(1, self.L + 1):
                    self.parameters["W" + str(l)] = self.parameters["W" + str(l)] - learning_rate * derivatives[
                        "dW" + str(l)]
                    self.parameters["b" + str(l)] = self.parameters["b" + str(l)] - learning_rate * derivatives[
                        "db" + str(l)]

                acc += self.predict(X_mini, y_mini)

            
            self.costs.append(loss)
            print("Epoch",loop+1,"\steps ",len(mini_batches),"Train loss: ", "{:.4f}".format(loss/len(mini_batches)),
                                                "Train acc:", "{:.4f}".format(acc/len(mini_batches)))

Run and see output

    train_x , test_x , train_y , test_y = load_mnist()
    
    layers_dims = [10, 10]
    
    ann = ANN(layers_dims,train_x.shape[1])
    ann.fit(train_x, train_y, learning_rate=.1, n_iterations=100,batch=64)

I am showing you only last few steps of training model output as it has a large number of iterations

After completing the traing lets plot loss to see how it changes over each epoch

Sanity check with a single image prediction

Our model performs pretty well as you can see , it has 92.3% test accuracy with a train accuracy with 95.45%.There are lot of scope to update this model and get higher accuracy.Let me know your approach to get better performance.

Thank you, for reading.Here is the full code

References

MIT Deep Learning 6.S191MIT Deep Learning 6.S191http://introtodeeplearning.com

What is Gradient Descent? | IBM

Learn about gradient descent, an optimization algorithm used to train machine learning models by minimizing errors…

www.ibm.com

https://en.wikipedia.org/wiki/Backpropagation

CS230 Deep Learning

Deep Learning is one of the most highly sought after skills in AI. In this course, you will learn the foundations of…

cs230.stanford.edu