MNIST(hand written digit) Classification Using Neural Network(Step by Step) From Scratch

Koushik
6 min readAug 31, 2023

--

MNIST digit classification

What is Neural Network ?

Neural networks, also known as artificial neural networks (ANNs), are a type of machine learning that power deep learning algorithms. Their name and form are inspired by the human brain, and they replicate the way biological neurons communicate with one another.In order to learn and gain accuracy over time, neural networks rely on training data. However, these learning algorithms become effective tools in computer science and artificial intelligence.

Let’s explore, how neural network works.

Consider every node to be a separate linear regression model, with input data(X), weights(w), a bias(b), and an output(Z). The formula should look like this:

compute Z for each perceptron

is the method by which a neural network produces an output for a specific input. The final layer’s output is also referred to as the neural network’s prediction. We will go through how we evaluate predictions later on in this article. These assessments can be used to determine whether or not our neural network needs to be improved.

We compute the cost function immediately following the output generation by the final layer. The cost function calculates how far off from the desired predictions our neural network is. The cost function’s value demonstrates the discrepancy between the true value and the forecasted value.

Here is one of the cost function:

Categorical Cross Entropy

Here , 1/m scales the loss results, yi represents the actual ouput and log(yhat) represents the predicted output.The most interesting thing about this loss function is the negative sign at the starting, it just tradeoff the logarithmic negative output as log(close to 0) gives negative value.

BackPropagation

As a machine-learning algorithm, backpropagation performs a backward pass to adjust a neural network model’s parameters, aiming to minimize the Loss.

Let’s see how it compute gradients

Compute gradient of loss with respect to w2 ,image:MIT 6.S191
Compute gradient of loss with respect to w1 ,image:MIT 6.S191

Gradient Descent

The parameters of the neural network are updated using gradient descent This algorithm modify the weights and biases of each layer in the network based on how the minimization of the cost function will be impacted. Backpropagation is used to calculate the impact on the weights and biases of each input neuron in the network on the minimization of the cost function.

Derivative of loss function with respect to (w,b) and Compute gradients for loss optimization.Here J is loss function and gradient w represents the direction in space. The goal here is to reach the global minima.

Gradient descent

Update gradients

Here alpha is the learning rate ,w is old weight and dl/dw is derivative loss with respect to weight.For new bias we did the same.

Update w,b according to loss function

Let’s implement these ideas into code

Initialize parameter for each layer of the network.You have the flexibility to initialize weight in different kinds of distribution for ex: uniform,random normal distribution or any other distribution if you want.Here i use numpy randn which is basically generate standard normal distribution.

def initialize_parameters(self):
np.random.seed(42)

for l in range(1, len(self.layers_size)):
self.parameters["W" + str(l)] = np.random.randn(self.layers_size[l], self.layers_size[l - 1]) / np.sqrt(
self.layers_size[l - 1])
self.parameters["b" + str(l)] = np.zeros((self.layers_size[l], 1))

Forward Propagation

def forward(self, X):
dict = {}

A = X.T
for l in range(self.L - 1):
Z = self.parameters["W" + str(l + 1)].dot(A) + self.parameters["b" + str(l + 1)]
A = self.sigmoid(Z)
dict["A" + str(l + 1)] = A
dict["W" + str(l + 1)] = self.parameters["W" + str(l + 1)]
dict["Z" + str(l + 1)] = Z

Z = self.parameters["W" + str(self.L)].dot(A) + self.parameters["b" + str(self.L)]
A = self.softmax(Z)
dict["A" + str(self.L)] = A
dict["W" + str(self.L)] = self.parameters["W" + str(self.L)]
dict["Z" + str(self.L)] = Z

return A, dict

Backpropagation

def backward(self, X, Y, dict):

derivatives = {}

dict["A0"] = X.T

A = store["A" + str(self.L)]
dZ = A - Y.T

dW = dZ.dot(dict["A" + str(self.L - 1)].T) / self.batch
db = np.sum(dZ, axis=1, keepdims=True) / self.batch
dAPrev = dict["W" + str(self.L)].T.dot(dZ)

derivatives["dW" + str(self.L)] = dW
derivatives["db" + str(self.L)] = db

for l in range(self.L - 1, 0, -1):
dZ = dAPrev * self.sigmoid_derivative(dict["Z" + str(l)])
dW = 1. / self.batch * dZ.dot(dict["A" + str(l - 1)].T)
db = 1. / self.batch * np.sum(dZ, axis=1, keepdims=True)
if l > 1:
dAPrev = dict["W" + str(l)].T.dot(dZ)

derivatives["dW" + str(l)] = dW
derivatives["db" + str(l)] = db

return derivatives

Update gradients with mini batch

The mini-batch is a fixed number of training examples that is less than the actual dataset. So, in each iteration, we train the network on a different group of samples until all samples of the dataset are used.You have the flexibility to choose the number of batch size. In theory it’s prudent to choose any number power of base 2.

def fit(self, X, Y, learning_rate=1, n_iterations=10,batch=32):
np.random.seed(1)
self.batch = batch
for loop in range(n_iterations):

mini_batches = self.create_mini_batches(X, Y, self.batch)
loss = 0
acc = 0
for mini_batch in mini_batches:
X_mini, y_mini = mini_batch
A, store = self.forward(X_mini)
loss += -1*np.mean(y_mini * np.log(A.T+ 1e-8))# CCE cost function A.T is updated weight
derivatives = self.backward(X_mini, y_mini, store)

for l in range(1, self.L + 1):
self.parameters["W" + str(l)] = self.parameters["W" + str(l)] - learning_rate * derivatives[
"dW" + str(l)]
self.parameters["b" + str(l)] = self.parameters["b" + str(l)] - learning_rate * derivatives[
"db" + str(l)]

acc += self.predict(X_mini, y_mini)


self.costs.append(loss)
print("Epoch",loop+1,"\steps ",len(mini_batches),"Train loss: ", "{:.4f}".format(loss/len(mini_batches)),
"Train acc:", "{:.4f}".format(acc/len(mini_batches)))

Run and see output

    train_x , test_x , train_y , test_y = load_mnist()

layers_dims = [10, 10]

ann = ANN(layers_dims,train_x.shape[1])
ann.fit(train_x, train_y, learning_rate=.1, n_iterations=100,batch=64)

I am showing you only last few steps of training model output as it has a large number of iterations

model out over each epoch

After completing the traing lets plot loss to see how it changes over each epoch

Loss curve

Sanity check with a single image prediction

Image prediction

Our model performs pretty well as you can see , it has 92.3% test accuracy with a train accuracy with 95.45%.There are lot of scope to update this model and get higher accuracy.Let me know your approach to get better performance.

Thank you, for reading.Here is the full code

References

MIT Deep Learning 6.S191MIT Deep Learning 6.S191http://introtodeeplearning.com

https://en.wikipedia.org/wiki/Backpropagation

--

--