Digit Prediction using Multinomial Naive Bayes in Python
The full code can be found here. github
We import the data:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv(‘train.csv’)
test = pd.read_csv(‘test.csv’ )
Goal is to Classify handwritten digits using the famous MNIST data. The data is in 28X28 pixel data(height and width) and each data point is weighted in integer between 0 and 255. We will convert these data to single array of 784 pixels or data. Visually, if we omit the “pixel” prefix, the pixels make up the image like this:
000 001 002 003 ... 026 027
028 029 030 031 ... 054 055
056 057 058 059 ... 082 083
| | | | ... | |
728 729 730 731 ... 754 755
756 757 758 759 ... 782 783
We can use the following code to print some sample data.
%matplotlib inline
import matplotlib.pyplot as plt
features_train[0].shape
B = np.reshape(features_train[0], (28, 28))
plt.imshow(B,cmap=plt.cm.gray_r,interpolation=’nearest’)

We extract the features and label of the data.
# Select the first 2 columns
features_train = train.iloc[0:2000,1:785].as_matrix(columns=None)
labels_train = train.iloc[0:2000,0:1].as_matrix(columns=None)
features_test = train.iloc[2001:3000,1:785].as_matrix(columns=None)
labels_test = train.iloc[2001:3000,0:1].as_matrix(columns=None)
We then apply the multi-nomial Naive Bayes.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(features_train, labels_train.ravel())
Finally, we apply the prediction to test data.
y_pred = clf.predict(features_test)
y_true = labels_test.ravel()
We evaluate the accuracy of our model.
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
which come to be around 0.83283283283283283. Not bad, but lots of opportunity for tuning.