Handwritten digit recognition using CNN

Arnab Das
7 min readMay 6, 2022

--

Photographed by Arnab Das, Sikkim, India

Handwritten digit or MNIST digit recognition program is the ‘HELLO WORLD’ program of CNN. This is a very good start for CNN learning.

About the MNIST Dataset :

MNIST Dataset consists of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Among 70,000 images, there 60,000 training images and 10,000 testing images. All the images are28*28 pixels. This dataset is so much famous that when a new classification algorithm is invented then people applies it to see how the algorithm will perform on MNIST dataset.

Objective :

Our objective is to build a deep learning model using CNN to classify the handwritten digit. As it has 0 to 9 digits to predict so it is a multiclass classification problem.

How it works :

Deep Learning model does not understand images or text. It only understands numbers in form of vector/ matrix/tensor. So for making the model we just use the pixel values of the images. The pixel values are preprocessed in a certain form of nD vectors (later we will discuss).

The main motto is to extract the useful features or patterns from those images so that based on this features we can train our model and get the expected output. So how do we extract features? Well, here comes the role of CNN. To extract features (like vertical edges, horizontal edges and so on) we apply some filters a.k.a kernels on the images.

Now question arises how to apply filters? Actually filters or kernels are nothing but some n*n matrix with some distinct combination of numbers. This matrix is applied on the image pixels matrix and do some basic vector operations. That is all. The result is a new matrix with the useful information of the image. This type of layer is called Convo2D.

We use this kernels multiple time for extracting the features. As the images is combination of pixels it is very costly to train. So to reduce size of this matrix but keeping most of the information intact pooling is used. This is also a filter that reduces the size of matrix (image) keeping most of the information intact. This type of layer is called MaxPooling 2D/MinPooling2D and so on.

After that we can use some regularization so that the model does not overfitting. Dropout is one of the technique of this regularization. This is done in Dropout layer.

How many number of filters, pooling and regularization layers should be there and what will be the kernels size that is fixed by hyperparameter tuning.

That is all for CNN. Then this matrix is flatten and fed to multiple ANN layers for training. The last layer of ANN should be the softmax layer with 10 output as it classify 10 class of digits.

Outline of this Project :

  1. Import the necessary libraries
  2. Load the dataset
  3. Explore the dataset
  4. Preprocess the data for CNN model
  5. Create a CNN model architecture
  6. Compile and fit the data to the model
  7. Evaluate the model
  8. Save the model and prediction

Importing the necessary Libraries :

# importing python libraries

import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import BatchNormalization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Load the dataset :

The dataset is available in keras. We can import it from there. As previously discussed the training and testing data is already devided.

# importing the MNIST data module from tensorflow.keras
mnist = tf.keras.datasets.mnist
# Load data set
(X_train,y_train), (X_test,y_test) = mnist.load_data()

Explore the dataset :

Let check the size of dataset.

print(f"X_train_shape : {X_train.shape} \ny_train.shape {y_train.shape}")

Output :

X_train_shape : (60000, 28, 28)
y_train.shape (60000,)

X_train_shape is (6000,28,28). It indicates that total 6000 training_instances and each instance is 20*28

Now try to plot a image from this dataset.

plt.imshow(X_train[0])
print("label for oth item is ",y_train[0])

Output :

label for oth item is 5

Preprocess the data for CNN model :

Unlike machine learning in CNN it is required to provide the color channel information. As these are greyscale images we reshape each images 28*28*1 because one color channel is required for grey scale image.

# reshape the dataset to row*col*1 because it is greyscale so only 1 channel is needed

X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
print(f"train-shape = {X_train.shape}\ntest-shape = {X_test.shape}")

Output :

train-shape = (60000, 28, 28, 1)
test-shape = (10000, 28, 28, 1)

It is preferable to convert all the values to float for better flexibilities. In deep learning it is very much essential to normalize the input value so that all the inputs are in a given ranges. Otherwise the performance of the model will be not preferable. So lets normalize it.

# convert int to float32
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# normalize the input to 0-1
X_train = X_train/255.0
X_test = X_test/255.0

This is a multiclass classification problem. So we have to do one_hot_encoding as followes.

# one-hot-encoding y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)
y_train.shape

Output :

(60000, 10)

Create a CNN model architecture :

Now the main part. Coding of creation of a model is very much easy. But also the most trickiest part. Because there is no distinct rule regarding the number of layers and number of neurons per layer, regularization. So we have to fine tune the hyperparameters using some techniques (like grid search).

Here Conv2D and Maxpooling2D is used for feature extraction. Dropout is used for regularization that restricts the overfitting of the model. Then flatten is used to reduce matrices to 1d array so that we can give this as a input to the dense layer . Relu activation function is used here. After that again batchnormalize it. At last next dense layer is the last layer. As total 10 digits have to classify so 10 neuron is selected and softmax activation function is used.

Model block diagram
# create sequential cnn model

model = Sequential([
Conv2D(32, (5,5), input_shape=(28,28,1), padding = 'same', activation = 'relu'),
MaxPooling2D(pool_size=(2,2),padding="SAME"),
Dropout(0.2),
Flatten(),
Dense(128,activation='relu'),
(BatchNormalization()),
Dense(total_classes,activation='softmax')
])

Compile and fit the data to the model :

Lets compile and fit the model. It is a time-consuming process and also computationally expensive and power hungry. As its a multiclass classification so, loss = categorical_crossentropy is used. Adam optimizer and accuracy metrics are used for optimization and measuring the accuracy respectively.

# compile the model
# loss = categorical_crossentropy
# optimization technique = adam
# metrics = accuracy

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics=['accuracy'])
print(model.summary())

Output :

Model: "sequential_26"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_27 (Conv2D) (None, 28, 28, 32) 832

max_pooling2d_14 (MaxPoolin (None, 14, 14, 32) 0
g2D)

dropout_21 (Dropout) (None, 14, 14, 32) 0

flatten_21 (Flatten) (None, 6272) 0

dense_44 (Dense) (None, 128) 802944

batch_normalization_1 (Batc (None, 128) 512
hNormalization)

dense_45 (Dense) (None, 10) 1290

=================================================================
Total params: 805,578
Trainable params: 805,322
Non-trainable params: 256
_________________________________________________________________
None

Fit the data after compiling.

# model fit
history = model.fit(X_train, y_train, epochs=10, batch_size=200,validation_data=(X_test, y_test)).history

Output :

Epoch 1/10
285/285 [==============================] - 24s 83ms/step - loss: 0.0043 - accuracy: 0.9987 - val_loss: 0.0392 - val_accuracy: 0.9913
Epoch 2/10
285/285 [==============================] - 24s 83ms/step - loss: 0.0035 - accuracy: 0.9990 - val_loss: 0.0513 - val_accuracy: 0.9907
Epoch 3/10
285/285 [==============================] - 42s 146ms/step - loss: 0.0038 - accuracy: 0.9988 - val_loss: 0.0445 - val_accuracy: 0.9920
Epoch 4/10
285/285 [==============================] - 44s 155ms/step - loss: 0.0025 - accuracy: 0.9994 - val_loss: 0.0454 - val_accuracy: 0.9913
Epoch 5/10
285/285 [==============================] - 41s 142ms/step - loss: 0.0025 - accuracy: 0.9992 - val_loss: 0.0494 - val_accuracy: 0.9900
Epoch 6/10
285/285 [==============================] - 52s 182ms/step - loss: 0.0028 - accuracy: 0.9992 - val_loss: 0.0517 - val_accuracy: 0.9893
Epoch 7/10
285/285 [==============================] - 27s 96ms/step - loss: 0.0030 - accuracy: 0.9991 - val_loss: 0.0475 - val_accuracy: 0.9897
Epoch 8/10
285/285 [==============================] - 47s 166ms/step - loss: 0.0032 - accuracy: 0.9990 - val_loss: 0.0482 - val_accuracy: 0.9923
Epoch 9/10
285/285 [==============================] - 45s 159ms/step - loss: 0.0036 - accuracy: 0.9992 - val_loss: 0.0454 - val_accuracy: 0.9987
Epoch 10/10
285/285 [==============================] - 24s 83ms/step - loss: 0.0028 - accuracy: 0.9994 - val_loss: 0.0393 - val_accuracy: 0.9989

Evaluate :

As the validation and training accuracy is same so the model is not overfitting. The accuracy is above 99%. So the model is not underfitting. So no need to hyperparameter tuning. Lets see the accuracy and loss plot of training and validation set.

plt.plot(history['accuracy'])
plt.plot(history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')

Output :

Save the model and prediction :

The accuracy is good for both testing and training dataset . So we finalize this model and save it.

#save the model
model.save('cnn_mnist_digit.h5')

Now time to prediction.

# model predict

img = X_train[0]
img = img.reshape(1,28,28,1)

predict_value = model.predict(img)
np.argmax(predict_value)

Output :

5

That is all. We can now predict the handwritten digit.

--

--

Arnab Das

MTech student of Artificial Intelligence at NIT Agartala.