Handwritten Digits Prediction Analysis

Aditya Jetely
5 min readOct 13, 2020

--

A Brief Analysis of the Handwritten digits prediction results.

Photo by Antoine Dautry on Unsplash

Introduction

In this notebook I have performed the analysis of Digits recognition accuracy, the dataset used in this notebook comes inbuilt with the scikit-learn library so it needs not to be downloaded.

Problem Statement

The Digits data set of the scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis

Importing Libraries

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Loading Data

from sklearn import datasets
digits = datasets.load_digits()
digits.data
array([[ 0., 0., 5., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.],
[ 0., 0., 0., ..., 16., 9., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 2., ..., 12., 0., 0.],
[ 0., 0., 10., ..., 12., 1., 0.]])

Understanding the data

digits.data.shape(1797, 64)

We can see that our dataset has 1797 images and all the images are 8x8 in dimension.

digits.target.shape(1797,)

We have 1797 labels for our target that contains values from 0 to 9.

Let's see some Images and labels

plt.figure(figsize=(15,9))
for index, (image, label) in enumerate(zip(digits.data[0:6], digits.target[0:6])):
plt.subplot(1, 6, index + 1)
plt.imshow(np.reshape(image, (8,8)), cmap='gray')
plt.title(f'Training: {label}')
png

Splitting Data into Training and Test Sets

Lets split our data so it can be used for training our model and later for testing purpose.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=42)

Let’s see our train data a bit.

X_trainarray([[ 0.,  0.,  3., ..., 13.,  4.,  0.],
[ 0., 0., 9., ..., 3., 0., 0.],
[ 0., 0., 0., ..., 6., 0., 0.],
...,
[ 0., 0., 9., ..., 16., 2., 0.],
[ 0., 0., 1., ..., 0., 0., 0.],
[ 0., 0., 1., ..., 1., 0., 0.]])
y_trainarray([6, 0, 0, ..., 2, 7, 1])

Modelling our dataset

from sklearn.linear_model import LogisticRegression # Loading the regressormodel = LogisticRegression() # creating instance of regression classmodel.fit(X_train, y_train) # fitting the modelLogisticRegression()

Making individual predictions.

model.predict(X_test[0].reshape(1,-1))array([6])

Making random digits predictions.

model.predict(X_test[0:6])array([6, 9, 3, 7, 2, 1])model.predict(X_test[15:21])array([2, 3, 7, 8, 8, 4])model.predict(X_test[45:51])array([1, 3, 0, 6, 5, 5])

Let’s visualize some of the results of the dataset.

plt.figure(figsize=(15,9))
for index, (image, label) in enumerate(zip(X_test[0:6], y_test[0:6])):
plt.subplot(1, 6, index + 1)
plt.imshow(np.reshape(image, (8,8)), cmap='gray')
plt.title(f'Training: {label}')
png
plt.figure(figsize=(15,9))
for index, (image, label) in enumerate(zip(X_test[15:20], y_test[15:20])):
plt.subplot(1, 5, index + 1)
plt.imshow(np.reshape(image, (8,8)), cmap='gray')
plt.title(f'Training: {label}')
png

Making predictions on the entire dataset

predictions = model.predict(X_test)
predictions
array([6, 9, 3, 7, 2, 1, 5, 2, 5, 2, 1, 9, 4, 0, 4, 2, 3, 7, 8, 8, 4, 3,
9, 7, 5, 6, 3, 5, 6, 3, 4, 9, 1, 4, 4, 6, 9, 4, 7, 6, 6, 9, 1, 3,
6, 1, 3, 0, 6, 5, 5, 1, 3, 5, 6, 0, 9, 0, 0, 1, 0, 4, 5, 2, 4, 5,
7, 0, 7, 5, 9, 5, 5, 4, 7, 0, 4, 5, 5, 9, 9, 0, 2, 3, 8, 0, 6, 4,
4, 9, 1, 2, 8, 3, 5, 2, 9, 0, 4, 4, 4, 3, 5, 3, 1, 3, 5, 9, 4, 2,
7, 7, 4, 4, 1, 9, 2, 7, 8, 7, 2, 6, 9, 4, 0, 7, 2, 7, 5, 8, 7, 5,
7, 5, 0, 6, 6, 4, 2, 8, 0, 9, 4, 6, 9, 9, 6, 9, 0, 5, 5, 6, 6, 0,
6, 4, 3, 9, 3, 8, 7, 2, 9, 0, 6, 5, 3, 6, 5, 9, 9, 8, 4, 2, 1, 3,
7, 7, 2, 2, 3, 9, 8, 0, 3, 2, 2, 5, 6, 9, 9, 4, 1, 2, 4, 2, 3, 6,
4, 8, 5, 9, 5, 7, 8, 9, 4, 8, 1, 5, 4, 4, 9, 6, 1, 8, 6, 0, 4, 5,
2, 7, 1, 6, 4, 5, 6, 0, 3, 2, 3, 6, 7, 1, 9, 1, 4, 7, 6, 5, 8, 5,
5, 1, 5, 2, 8, 8, 9, 9, 7, 6, 2, 2, 2, 3, 4, 8, 8, 3, 6, 0, 9, 7,
7, 0, 1, 0, 4, 5, 1, 5, 3, 6, 0, 4, 1, 0, 0, 3, 6, 5, 9, 7, 3, 5,
5, 9, 9, 8, 5, 3, 3, 2, 0, 5, 8, 3, 4, 0, 2, 4, 6, 4, 3, 4, 5, 0,
5, 2, 1, 3, 1, 4, 1, 1, 7, 0, 1, 5, 2, 1, 2, 8, 7, 0, 6, 4, 8, 8,
5, 1, 8, 4, 5, 8, 7, 9, 8, 6, 0, 6, 2, 0, 7, 9, 8, 9, 5, 2, 7, 7,
1, 8, 7, 4, 3, 8, 3, 5])

Measuring Model Performance

score = model.score(X_test, y_test)
score
0.9694444444444444

Our model has an accuracy of around 97% which is quite good.

Conclusion

  • The accuracy obtained for the model is above 95% so the hypothesis provided in the problem statement holds True.
  • The accuracy is high for this dataset because it is relatively small and not a real-world dataset.
  • We can also use SVM for the same process and obtain results.

--

--

Aditya Jetely

Final Year Electronics and Communication Engineering Student with a keen interest in data science and open source. https://www.linkedin.com/in/aditya-jetely