Recognizing Handwritten Digits with scikit-learn

6 min readAug 22, 2021

In recent years, handwritten digit identification has proven to be a difficult task. Many real-life events necessitate the classification of handwritten text or numbers. Digit recognition is used in postal mail sorting, bank check processing, and form data entry, among other things.

Consider the ZIP codes on mail and the automation required to recognise these five digits at the post office. To sort mail mechanically and effectively, perfect recognition of these codes is required. OCR (Optical Character Recognition) software is one of the numerous applications that may spring to mind. For general electronic documents with well-defined characters, OCR software must read handwritten text or pages from printed books.

Hypothesis :

The Scikit-learn library’s Digits data set contains a variety of data sets that can be used to test various data analysis and prediction challenges. According to certain researchers, it correctly predicts the digit 95% of the time. Analyze the data to see if the hypothesis is true or false.

Prerequisites :

Sklearn

Matplotlib

Basics of Machine learning

import matplotlib.pyplot as plt

Dataset :

We’re going to use the Handwritten Digits dataset from the Sklearn library for this project. Using the code below, we can import the dataset.

from sklearn import datasets
digits = datasets.load_digits()

I have loaded the the dataset load_digits() and created an instance of the dataset ‘digits’. I am going to find out more about this dataset by using DESCR method.

print(digits.DESCR)output:
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

.. topic:: References

  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

There are 1797 grayscale photos of handwritten digits in this dataset. There are no null values in each image, which is an 8x8 matrix. The entire data set is a dictionary with a large number of keys. Images are stored in the images key, data in the data keys (matrices), target in the target values of the images, and target names in the target names.

checking all the key values of this dataset

digits.keys()output:
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

The images of the handwritten digits are contained in a digits.images array. Each element of this array is an image that is represented by an 8x8 matrix of numerical values that correspond to a grayscale from white, with a value of 0, to black, with the value 15.

digits.images[0]output:
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

Our data-set is stored in digits. By the command given below, you will obtain a grayscale image of digit. Now , let us plot the first five image.

for i in range(5):
    plt.matshow(digits.images[i],cmap=plt.cm.gray_r)

Let us now view targets and it’s size

digits.target[0:5]OUTPUT:
array([0, 1, 2, 3, 4])
digits.target.sizeOUTPUT:
1797

Visualizing the images and labels in our Dataset.

plt.subplot(321)plt.imshow(digits.images[0], cmap=plt.cm.gray_r,interpolation='nearest')plt.subplot(322)plt.imshow(digits.images[1], cmap=plt.cm.gray_r, interpolation='nearest')plt.subplot(323)plt.imshow(digits.images[2], cmap=plt.cm.gray_r, interpolation='nearest')plt.subplot(324)plt.imshow(digits.images[3], cmap=plt.cm.gray_r, interpolation='nearest')plt.subplot(325)plt.imshow(digits.images[4], cmap=plt.cm.gray_r, interpolation='nearest')plt.subplot(326)plt.imshow(digits.images[5], cmap=plt.cm.gray_r, interpolation='nearest')

OUTPUT:

Splitting the data in Train and Test using function train_test_split

from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(digits.data, digits.target,test_size=0.2)X_train.shape, X_test.shapeOUTPUT:((1437, 64), (360, 64))

Model Planning:

To see how different models work on different data sizes we are using 3 models Support vector Classifier, Decision Tree Classifier, Random Forest Classifier.

Support Vector Classifier :

The support vector machine algorithm’s goal is to find a hyperplane in an N-dimensional space (N — the number of features) that distinguishes between data points.

Code :

from sklearn import svmsvc = svm.SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False,random_state=None, shrinking=True, tol=0.001, verbose=False)svc.fit(X_train, y_train)OUTPUT:
SVC(C=100.0, gamma=0.001)predictions1 = svc.predict(X_test)
predictions1OUTPUT:

#Case1pred1 = svc.predict(digits.data[0:5])
pred1OUTPUT:array([0, 1, 2, 3, 4])
#Case2pred2 = svc.predict(digits.data[1791:1796])
pred2OUTPUT:array([4, 9, 0, 8, 9])
#Case3pred3 = svc.predict(digits.data[700:710])
pred3OUTPUT:array([2, 0, 1, 2, 6, 3, 3, 7, 3, 3])digits.target[700:710]OUTPUT:array([2, 0, 1, 2, 6, 3, 3, 7, 3, 3])# We use classification materics as accuracy_score
# import accuracy_scorefrom sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions1)
OUTPUT:0.9833333333333333

2. Decision Tree Classifier :

A simple and extensively used classification technique is the Decision Tree Classifier. It solves the classification problem with a simple concept. The Decision Tree Classifier asks a series of well-crafted queries about the test record’s properties. Each time it receives a response, it asks a follow-up question until it reaches a conclusion regarding the record’s class label.

Code :

# import the Classifier
from sklearn.tree import DecisionTreeClassifier# Instanciate Model
# we can also use criterion = 'entropy' both lead us to nearly same # resultdt = DecisionTreeClassifier(criterion='gini')# fit the data on model
dt.fit(X_train, y_train)# prediction on test data
predictions2 = dt.predict(X_test)accuracy_score(y_test, predictions2)
OUTPUT:0.8527777777777777

3. Random Forest Classifier :

Random forests is an algorithm for supervised learning. It has the ability to be utilised for both classification and regression. It’s also the most adaptable and user-friendly algorithm.

Code :

from sklearn.ensemble import RandomForestClassifierrc = RandomForestClassifier(n_estimators=150)
rc.fit(X_train, y_train)predictions3 = rc.predict(X_test)accuracy_score(y_test, predictions3)
OUTPUT:0.9694444444444444

Conclusion :

From this article, we can see how easily we can import a dataset, build a model using Scikit-Learn, train the model, make predictions with it, and can find the accuracy of our prediction(which is 99.4% highest in the case of Support Vector Classifier (SVC) among all three classifiers).

I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Exprience. Thank you www.suvenconsultants.com

Written by Pratyush Gautam