Handwritten Digit Recognition using Machine Learning

Himanshu Beniwal
17 min readDec 22, 2018

--

Machine learning and deep learning plays an important role in computer technology and artificial intelligence. With the use of deep learning and machine learning, human effort can be reduced in recognizing, learning, predictions and many more areas. This article presents recognizing the handwritten digits (0 to 9) from the famous MNIST dataset, comparing classifiers like KNN, PSVM, NN and convolution neural network on basis of performance, accuracy, time, sensitivity, positive productivity, and specificity with using different parameters with the classifiers.

Sample Digits from MNIST dataset

Handwritten digit recognition has gained so much popularity from the aspiring beginner of machine learning and deep learning to an expert who has been practicing for years. Developing such a system includes a machine to understand and classify the images of handwritten digits as 10 digits (0–9). Handwritten digits from the MNIST database are already famous among the community for many recent decades now, as decreasing the error rate with different classifiers and parameters along with preprocessing techniques from 12% error rate with linear classifier (1 layer NN) to achieving 0.23% error rate with hierarchy of 35 convolution neural networks [Yann LeCun, MNIST database of handwritten digits]. The scope of this article is to compare the different classifiers with different parameters and try to achieve near-human performance.

Digit Recognition System

Digit recognition system is the working of a machine to train itself or recognizing the digits from different sources like emails, bank cheque, papers, images, etc. and in different real-world scenarios for online handwriting recognition on computer tablets or system, recognize number plates of vehicles, processing bank cheque amounts, numeric entries in forms filled up by hand (say — tax forms) and so on

Problems with handwritten digits

The handwritten digits are not always of the same size, width, orientation and justified to margins as they differ from writing of person to person, so the general problem would be while classifying the digits due to the similarity between digits such as 1 and 7, 5 and 6, 3 and 8, 2 and 5, 2 and 7, etc. This problem is faced more when many people write a single digit with a variety of different handwritings. Lastly, the uniqueness and variety in the handwriting of different individuals also influence the formation and appearance of the digits. Now we introduce the concepts and algorithms of deep learning and machine learning.

MNIST Dataset

Samples provided from MNIST (Modified National Institute of Standards and Technology) dataset includes handwritten digits total of 70,000 images consisting of 60,000 examples in training set and 10,000 examples in testing set, both with labeled images from 10 digits (0 to 9). This is a small segment form the wide set from NIST where size was normalized to fit a 20*20 pixel box and not altering the aspect ratio. Handwritten digits are images in the form of 28*28 gray scale intensities of images representing an image along with the first column to be a label (0 to 9) for every image. The same has opted for the case of the testing set as 10,000 images with a label of 0 to 9.

Yann Lecun, Corinna Cortes, and Christopher Burges developed this MNIST dataset for evaluating and improving machine learning models on the handwritten digit classification problem. The MNIST dataset was developed from the special dataset from NIST with special database 3 (United States Census Bureau employees) and special database 1 (high school students) which consist with the binary images of handwritten digits. Earlier SD-3 (special database -3) was considered as training and SD-1 (special database -1) as testing set with easier recognizing level of SD-3. Therefore to keep it challenging, disjoint and fair among different learning classifiers, NIST dataset was mixed up. Division of the MNIST took place by 30,000 samples from SD-3 and 30,000 samples from SD- 1 with 250 writers approx. and 5,000 samples from SD-3 and remaining 5,000 samples from SD-1 to form a different test set. Images of digits were taken from various scanned digits, normalized in size and justify as centered. This makes it an excellent dataset for evaluating models and allowing the machine learning aspirant to focus on deep learning and machine learning with very little data cleaning.

Talking about the newer or more modified version which is similar to the standard MNIST, an EMNIST or Extended MNIST have been emerged out in the year 2017 with the samples of 2, 40,000 images in training set along with increment to 40,000 images in the testing set consisting of handwritten digits.

Available files in the dataset

So, before starting further deep in this topic, the better point should be to get familiar with the provided dataset. Following points are same in training and testing set along with the set of the images and labels files –

  1. Pixels are arranged row-wise, ranging from 0 to 255, as from RGB color code.
  2. Background as white (0 value from RGB) and foreground as black (255 value from RGB).
  3. Labels of digits classified from 0 to 9.

There are four files of training and testing are:

  1. Training set images files (train-images-idx3-ubyte)
  2. Training set labels file (train-labels-idx1-ubyte)
  3. Test set images files (t10k-images-idx3-ubyte)
  4. Test set label files (t10k-labels-idx1-ubyte)

Understanding the dataset better

The MNIST dataset is provided in the format of IDX. This IDX file format is a simple format which comes handy when operating with vectors and high dimensional matrices of different numerical types. Starting with the magic number in the description column available in the file format. We can define a magic number as an integral value (say MSB first), where the first 2 bytes are always seen to be zero. This gives us the following information:

  1. The 0000 (2 bytes) informing the beginning of the file.
  2. 08 tells us that third byte is of unsigned byte type.
  3. The fourth byte, 03 tells us that the matrix has three dimensions and 01 informing with just one dimension.

The third byte represents whether the data is an integer, float, short, long or unsigned type. The fourth byte tells the dimension of the vector or matrix i.e. the number of rows and columns. If it is equal to 1, then it’s a vector else it is a matrix. The number of items variable is also read as MSB first.

Changing from IDX to simpler CSV

As our dataset is available in IDX format [Yann LeCun, the MNIST database of Handwritten Digits], we can change our dataset into CSV formats by algorithm [Joseph Chet Redmon, Algorithm to change idx into csv] and we can achieve MNIST dataset in CSV format. To better understand the CSV:

  1. The first column or value is the “label”, that is, the actual true digit that the handwriting is supposed to classify, such as a “7” or “9”. It is the correct solution to which the classifier is aspiring to classify.
  2. The remaining values or all comma separated values, are the pixel values intensities of the handwritten digit, varying from 0 to 255. The size of an image is 28 by 28, so there are 784 (28*28) values for the label. [Joseph Chet Redmon, Algorithm to change idx into csv].

CLASSIFIERS

In this section, we will be discussing various algorithms of machine learning and deep learning for making predictions and accuracy. Classifiers in machine learning –

KNN (K nearest neighbors)

KNN is the non-parametric method or classifier used for classification as well as regression problems. This is the lazy or late learning classification algorithm where all of the computations are derived until the last stage of classification, as well as this, is the instance-based learning algorithms where the approximation takes place locally. Being simplest and easiest to implement there is no explicit training phase earlier and the algorithm does not perform any generalization of training data.

When to use? The direct solution would be when these are nonlinear decision boundaries between classes or when the amount of data is large enough. Input features can be both qualitative and quantitative in nature. Whereas output features can be categorical values which are typical classes seen in data.

KNN explains categorical value using majority votes of K nearest neighbors where the value for K can differ, so on changing the value of K, the value of votes can also vary.

Assumption

  1. Being non-parametric, the algorithm does not make any assumption about the underlying data assumption.
  2. Select the parameter K based on data.
  3. Requires a distance metric to define proximity between any two data points. This distance can be calculated from the Euclidean distance, Mahalanobis distance, Hamming distance, etc.

Algorithm

  1. Compute the distance metric between the test data point and all labeled data points.
  2. Order the labeled data points in increasing order of distance metric.
  3. Select the top K labeled data points and look at class labels.
  4. Look for the class labels that majority of these K labeled data points have and assign it to test data points.

Things to consider –

  1. Parameter selection- Best choice of K depends on data. The larger value of K reduce the effect of noise on classification but makes the decision boundaries between classless distinct. The smaller value of K tends to be affected by noise with clear separation between classes.
  2. Presence of noise
  3. Feature selection and scaling- It is important to reduce irrelevant features. When the number of features is too large and suspected to be highly redundant, features extraction will be required. If the features are carefully chosen then it is expected that the classification will be better.
  4. Curse of dimensionality
Case 1 (K =3) and Case 2 (K = 5)

To understand better, let’s look at the different values for K. Say in case 1, value for K is 3. Then the class for the test data point would be of red color among the classes for red and blue. For K = 5 in case 2, then the predicted class would be of blue color from the KNN algorithm. So for changing the value of K, the output for the test data point can also vary. So it’s necessary to choose the value of K wisely. The large value for K can reduce the overall noise but there is no guarantee.

Distance Functions

Different distance functions used in KNN are-

  1. Euclidean function
  2. Manhattan function
  3. Minkowski
  4. Hamming distance
  5. Mahalanobis distance
Distance functions in KNN

Command to perform KNN

We can vary the parameters for the classifier and observe the change in the extraction of a classifier and perform a comparison of how well and efficiently work with different parameters and hyperparameters.

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)

SVM (Support Vector Machine)

SVM falls into the category of supervised learning, and with the bonus of classification as well as regression problems. Generally, SVM draws an optimal hyperplane which classifies into different categories. In two dimensional space, To start with we plot the data points of the independent variable corresponding to the dependent variables. Then, begin the classification process from looking the hyperplane or any linear or nonlinear plane differentiated the two class at its best.

Algorithm

First understand, in the case of binary classification:

  1. Identify the correct hyperplane which segregates the two classes better.
  2. Look for the maximum distance between nearest data point (of either any class) and hyperplane, the distance is measured as margin. So look for hyperplane with maximum margin both sides equally. Hyperplane with higher margin is more robust, whereas low margin has changed for misclassification.
  3. SVM selects the classifier accurately to maximized margin.
  4. SVM is robust to the classifier and have a feature to ignore outliers and try to look for a hyperplane with maximum margin.
SVM in classifying two classes (red and blue)

Tuning parameters

  1. Kernel: linear algebra play a role in transforming the learning of hyperplane in linear SVM.

F(x) = B (0) + sum (ai * (X,Xi))

2. Linear kernel: dot product is kernel and shown as — K(x, xi) = sum(x * xi), kernel is the similarity or distance measurement between a new data point and support vector hyperplanes. Kernel Trick is the separation line in higher dimension calculated by a polynomial and exponential tricks.

3. Polynomial kernel: same as kernel but a degree is specified. If d=1 transform into a linear kernel.

K(x,xi) = 1 + sum(x * xi)^d.

4. Radial kernel: more complex kernel is the radial kernel.

K(x,xi) = exp (-gamma * sum((x — xi²))

Where gamma is specified in an algorithm, good considered value for gamma is taken as 0.1 where gamma differs between 0 and 1. Closed polygons in two-dimensional are formed when the radial kernel creates complex regions within feature space.

5. Margin: Margin should be kept equidistant from both sides.

Computational command

class sklearn.svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)

NN (Neural networks)

Neural Networks mimics the working of how our brain works. They have emerged a lot in the era of advancements in computational power.

Neural networks with input, output, and hidden layers

Deep learning is the acronym for Neural Networks, the network connected with multilayers. The layers are composited form nodes. A node is just a perception which takes an input performs some computation and then passed through a node’s activation function, to show that up to what context signal progress proceeds through the network to perform classification.

Algorithm

  1. Initializing the weights randomly (not by keeping them zero)
  2. Implementing forward propagation to achieve ​(x(i)).
  3. Compute cost
  4. Now evaluate backpropagation to compute partial derivatives and use gradient checking to confirm that backpropagation is working fine. Then disable gradient checking.
  5. Use gradient descent or any built-in optimization function to minimize the cost function with weights of theta Θ.

Putting it together

Choose the layout of the neural network, consisting of a number of hidden units in each layer and how total layers

  1. Dimensions of features Xi is equal to the number of input units.
  2. The number of output units is the number of classes.
  3. The number of hidden units per layer is equal to usually more the better (must balance with a cost of computation as it increases with more hidden units).
  4. Defaults: 1 hidden layer, if more than 1 hidden layer, them the same number of units in every hidden layer.

Classifier command

MLP stands for multi-layer perceptron and here we use sklearn with MLPClassifier along with different parameters.

class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001, batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

CNN (Convolutional Neural Network)

Now let’s discuss the Convolutional Neural Networks, CNN has become famous among the recent times. CNN is part of deep, feed forward artificial neural networks that can perform a variety of task with even better time and accuracy than other classifiers, in different applications of image and video recognition, recommender system and natural language processing.

Arrangements of neurons in CNN

Use of CNN have spread as Facebook uses neural nets for their automatic tagging algorithms, google for photo search Amazon for their product recommendations, Pinterest for their home feed personalization and Instagram for search infrastructure. Image classification or object recognition is a problem is passing an image as a parameter and predicting whether a condition is satisfied or not (cat or not, dot or not), or the probability or most satisfying condition for an image. We are able to quickly recognize patterns, generalize from previous information and knowledge.

Difference what we see vs what system see

Inputs and output

When a computer or system takes an image, it just sees an array of pixel values. Suppose 480*480*3 where 480*480 is size, 3 refers to RGB values. Each of these numbers is assigned with a value of 0 to 255 as pixel intensities at that point. The key point is that based on taking the image as an input, computer system predicts and make an assumption as output for describing the probability of the image being a said or certain class (say 0.90 for class 1, 0.96 for class 2, 0.4 for class 3).

Algorithm

To see the performing steps for a system to predict, we can define algorithms as –

  1. Break the image into small image tiles — Similar to sliding window, we can pass sliding window over the entire large image and each result is saved as separate, as a segment of large image as tiny picture tile.
  2. Feeding each tiny tile into the smaller size neural network — we rarely initialize the parameters with the same values and if not so, then we mark that tile as interesting.
  3. Save the results from each small tile into a new array — we would not like to misplace the index of the original file. So we place the results in a grid of the same arrangement as an original image.
  4. Downsampling — to reduce the size of a newer array, downsampling is used by max-pooling.
CNN architecture in MNIST dataset

Layers of Convolutional neural network

The multiple occurring of these layers shows how deep our network is, and this formation is known as the deep neural network.

  1. Input: raw pixel values are provided as input.
  2. Convolutional layer: Input layers translates the results of neuron layer. There is need to specify the filter to be used. Each filter can only be a 5*5 window that slider over input data and get pixels with maximum intensities.
  3. Rectified linear unit [ReLU] layer: provided activation function on the data taken as an image. In the case of back propagation, ReLU function is used which prevents the values of pixels form changing.
  4. Pooling layer: Performs a down-sampling operation in volume along the dimensions (width, height).
  5. Fully connected layer: score class is focused, and a maximum score of the input digits is found.

As we go deeper and deeper in the layers, the complexity is increased a lot. But it might worth going as accuracy may increase but unfortunately, time consumption also increases.

PERFORMANCE MEASURES

In machine learning and deep learning, the performance or efficiency of a classifier is shown by various features which tells how well working the particular classifier is. As the names also suggest the measurements or values used to compare the performance of a classifier.

Confusion matrix

This is also same as error matrix, by confusion matrix it is easily shown that what percent of predictions made by our classifier was correct and where it was difficult for the classifier to predict the actual classification. In order to shown confusion matrix, it’s better to practice to represent in the form of the table. Well for creating a confusion matrix for our digits, we would face 10 classes say 10 rows and 10 columns where every digit will be compared with other digits, and we can easily show where our classifier predicted wrong and where it predicted correctly along with total number times.

Terminologies used are

  1. TP = True positive
  2. TN = True negative
  3. FP = False positive
  4. FN = False negative

TP is a correct identification of positive labels, TN is a correct identification of negative labels, FP is incorrect identification of positive labels, FN is incorrect identification of negative labels.

Starting with the statistical formulation.

Accuracy

Overall effectiveness of classifier, best defines the accuracy or portion of true results (meaning with the true positives and true negatives) from the total.

Accuracy = (TP + TN) / N, where N is sum of TP, TN, FN, FP.

The maximum value that accuracy can achieve is 1. This happens when classifier exactly classify two groups (i.e. FP = 0 and FN = 0)Remember that, The total number of True positive is TP+FN. The total number of a True negative is TN+FP.

Sensitivity

Sensitivity can be defined as the effectiveness of classifier to identify positive labels. This is also known as recall.

Sensitivity = (TP)/ (TP+FN)

Specificity

This is defined as the effectiveness of classifier to correctly identify negative labels.

Specificity = (TN) / (FP + TN)

Both Sensitivity and specificity lie between 0 and 1, 1 is an ideal value for each of them. We calculate balanced accuracy as taking an average of sensitivity and specificity.

Prevalence

Well, how often does the “yes” condition actually occur in our sample?

Prevalence = (TP + FN) / N

N is the sum of all conditions i.e. TP, FN, FP, TN.

Positive predicted values

The portion of correct value results in labels identified as positive.

Positive_predicted_value = (Sensitivity * Prevalence) / ( (Sensitivity * prevalence) + (1 — specificity) * (1 — prevalence) )

Negative predicted values

The proportion of correct results in the label identified as negative.

Negative_predicted_values = Specificity *(1 — prevalence) / (((1- sensitivity)*prevalence) + (specificity * (1 — prevalence)))

Detection rate

Detection rate is the division of true positives by the total number of conditions.

DR = TP / N

Expected accuracy

Also considered as random chance among the conditions

Expected_accuracy = ( (TP + FN) * (TP+FP) + (FP+TN) * (FN+TN) ) / N

Where N is sum of all conditions i.e. TP, FN, FP, and TN.

Kappa statistic

The Kappa statistic (or value) is a metric that compares an observed accuracy with an expected accuracy (say random chance).

Kappa = (Observed accuracy — expected_accuracy) / (1 — expected_accuracy)

There are a lot of performance metrics for a classifier to show how well it performs in these statistical situations. How fast the classifier predicted also improves the overall performance and we classify such as fast classifier. Through these examples, describing the breakdown of errors in the predictions for an unseen dataset is kept at first.

Result

For research purpose, or applying the classifiers to real scenario problems. Accuracy and speed of recognition are considered the better measure. Talking about the different classifiers one by one now.

Overall comparison results

To show the accuracy, time, specificity, sensitivity and other parameters comparison among different classifiers used in the training set.

Classifier comparison on training and test data

Conclusion and Future work

As using machine learning algorithms are used like KNN, SVM, Neural networks along with different parameters and feature scaling vectors, we also saw the different comparison among the classifiers in terms of the most important feature of accuracy and timing. Accuracy can alter as it depends on the splitting of training and testing data, and this can further be improved if the number of training and testing data is provided. There is always a chance to improve accuracy if the size of data increases. Every classifier has its own accuracy and time consumption. We can also include the fact that if the power of CPU changes to GPU, the classifier can perform with better accuracy and less time and better results can be observed.

The performance of the classifier can be measured in terms of ability to identify a condition properly (sensitivity), the proportion of true results (accuracy), number of positive results from the procedure of classification as false positives (positive predictions) and ability to exclude condition correctly (specificity). In this, we saw a brief comparison to the classifiers of Machine learning and deep learning.

Till now, the algorithms of Deep learning have performed better in the application of Handwritten Digit Recognition.

Future studies might consider using the architecture of the convolution network which gave the best result on the MNIST database and the proposed recognition system is implemented on handwritten digits. Such more system can be designed for handwritten characters recognition, object recognition, image segmentation, handwriting recognition, text language recognition, and future studies also might consider on hardware implementation on online digit recognition system with more performance and efficiency with live results from live testing case scenarios.

Thank you for reading my article.
Read my publication on Handwritten Digit Recognition using Machine Learning published in Internation Journal of Computer Science and Enginnering in June 2018 from here: https://www.ijcseonline.org/spl_pub_paper/IJCSE-ETACIT-2K18-019%20GEU.pdf

--

--

Himanshu Beniwal

Computer 💻 Vision | Machine Learning | Data Science | Social Network Analysis |