Learn to do Image Classification using Stochastic Gradient Descent and Random Forest Classifier

Gurupratap S Matharu
11 min readMar 2, 2019

--

Introduction

  • A machine learning classification model that is trained on the MNIST images dataset from the ML data.org repository.
  • We are doing supervised learning here and our aim is to do image classification and noise reduction
  • During our journey we’ll understand the important tools needed to develop a powerful ML model
  • Our aim is to play with tools like Stochastic Gradient Descent, Random Forest, confusion matrix, Precision, Recall, ROC curves, Area under curve and cross validation to reach our goal.
  • We’ll evaluate the performance of each of our classifier using Precision scores, Recall scores, and also tune hyper parameters to further optimize our model
  • We’ll validate our predictions against our test data set and conclude our learning
Understanding types of classification

To do an end-to-end Machine Learning project we need to do the following steps

1. Understand the requirements of the business.

2. Acquire the data set.

3. Visualize the data to understand it better and develop our intuition.

4. Pre-process the data to make it ready to feed to our ML model.

5. Try various models and train them. Select one that we find best.

6. Fine-tune our model by tuning hyper-parameters

7. Present our solution to the team.

8. Launch, monitor, and maintain our system.

1. Understand the requirements of the business

We are enthusiastic data scientists and before starting we need to ask some fundamental questions

Why does our organisation need this classifier or machine learning model?

  • - possibly we have a software product and adding image recognition capabilities could be a great advantage
  • - the organisation will use this data to feed another machine learning model
  • - current process is good but manual and time consuming
  • - our organisation wants an edge over competition
  • - we want to reduce noise from existing corrupted images and this data is valuable

Acquire the Dataset

import numpy as np
from six.moves import urllib
from sklearn.datasets import fetch_mldata
try:
mnist = fetch_mldata('MNIST original')
except urllib.error.HTTPError as ex:
print("Could not download MNIST data from mldata.org, trying alternative...")
# Alternative method to load MNIST, if mldata.org is down
from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
content = response.read()
f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
"data": mnist_raw["data"].T,
"target": mnist_raw["label"][0],
"COL_NAMES": ["label", "data"],
"DESCR": "mldata.org dataset: mnist-original",
}
print("Success!")

This might take some time. If it doesn’t work then download the dataset here https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat

Save it in your current directory and then run the following code.

from scipy.io import loadmat
mnist_path = './mnist-original.mat'
mnist_raw = loadmat(mnist_path)
mnist = {
"data": mnist_raw["data"].T,
"target": mnist_raw["label"][0],
"COL_NAMES": ["label", "data"],
"DESCR": "mldata.org dataset: mnist-original",
}
print("Success!")

Let’s check our dataset

mnist
{'data': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
'target': array([0., 0., 0., ..., 9., 9., 9.]),
'COL_NAMES': ['label', 'data'],
'DESCR': 'mldata.org dataset: mnist-original'}

The MNIST dataset is already divided into data and target labels. Let’s extract it now

X, y = mnist["data"], mnist["target"]X.shape (70000, 784)# so our data has 70000 instances(rows) and 784 features(columns)y.shape # our target label(70000,)

Let’s import Matplotlib and analyze an image

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
some_digit = X[36000]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
interpolation="nearest")
plt.axis("off")
plt.show()
# Let's cross check ity[36000]5.0

We divide the data set into 60,000 instances for training and the remaining 10,000 for test part like this…

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

There is something called Shuffling. It helps us to randomize our dataset before we start working with it. This is recommended. Let’s do it

shuffle_index = np.random.permutation(60000) # creates a random arrayX_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

Simple Binary Classification

For simplicity we’ll build a simple classifier that only detects if an image has 5 in it or not. Since answer is True or False this will be a binary classification.

y_train_5 = (y_train == 5) # Only true for 5s and False for all other digits# Remember y_train is all our target labels# y_train_5 is target labels with boolean values only for 5's# could be tricky but a nice shortcut# we repeat the same thing for our test datasety_test_5 = (y_test == 5)

Stochastic Gradient Descent Classifier

This is a good classifier to start with it as it large datasets efficiently and handles training instances independently

from sklearn.linear_model import SGDClassifiersgd_clf = SGDClassifier(random_state=42) # instantiatesgd_clf.fit(X_train, y_train_5) # train the classifiersgd_clf.predict([some_digit]) # make it predict some digit# array([ True])

For me it works and the classifier seems to work. What is your result? Shall we plot a few more digits?

for index in range(1000, 70000, 1000):
digit = X[index]
predicted = sgd_clf.predict([digit])
if predicted:
print("{} == {}".format(predicted, y[index]), end=",")
# prints[ True] == 1.0
[ True] == 1.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 8.0
[ True] == 5.0
# Left: Our classifier claims that it is 5
# Right: Actual value of the digit
# So there are failures. Let's proceed into more investigation

Performance Measures

Let’s evaluate our SGD classifier by using K-fold Cross Validation. We’ll use 3 folds and we’ll get a score for each fold.

from sklearn.model_selection import cross_val_scorecross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")# array([0.96485, 0.95 , 0.9494 ])# We get accuracy for each fold = 96.45%, 95%, 94.94%

This is not the best way as we are doing classification and not regression. But we have a powerful tool with us called…

Confusion Matrix

  • It is a grid of all labels against all labels for our classifier
  • Helps us identify which labels our classifier is predicting wrong
  • To plot it we need prediction scores. So lets evaluate them
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)confusion_matrix(y_train_5, y_train_pred)# Remember our sgd_clf is a binary (5 and not-5) classifier.
# So we get a 2X2 array
array([[53361, 1218],
[ 1497, 3924]])
# ideal confusion matrix has non-zero elements only in the
# diagonal like this...
confusion_matrix(y_train_5, y_train_5)
# array([[54579, 0],
# [ 0, 5421]])

Precision and Recall

These are fundamental tools to evaluate and fine tune a classifier. Consider this example

  • If our classifier claims that “these 10 images have an apple in it” but in reality only 8 images contain apples then the precision is 0.8 or 80%
  • If we give 10 images with apples to our classifier but it recognizes only 7 and rejects 3 then its recall is 0.7 or 70%
  • If we aim for higher precision we compromise on recall and vice versa
  • Ideally we want both high precision and high recall. But we need to find a thresholdvalue so that there is a trade-off
from sklearn.metrics import precision_score, recall_scoreprecision_score(y_train_5, y_train_pred)# 0.7631271878646441recall_score(y_train_5, y_train_pred)# 0.7238516878804648

F1 — Score

It is the harmonic mean of precision and recall and a more convenient way to compare a classifier. Let’s calculate it

from sklearn.metrics import f1_scoreprint("My dear the F1 score is = ", f1_score(y_train_5, y_train_pred))# My dear the F1 score is =  0.74297074694689

So how do we know which threshold to use for our classifier?

  1. First we need to find the decision scores of our classifier by calling for `decision function` instead of accuracy
  2. Then we plot Precision and Recall curves
  3. From that curve we can conclude a threshold score
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
method="decision_function")
# array([-1128277.51063884, -166965.96182519, -1236785.06540279, ...,
# -334287.64724002, -289424.98623176, -982906.79459895])
from sklearn.metrics import precision_recall_curveprecisions, recalls, thresholds = precision_recall_curve(y_train_5,
y_scores)

It is recommended to write a Matplot lib function to plot the curve

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):

plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])

Let’s plot it

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
Precision and Recall Vs Threshold

See that intersection point? There is our key for the threshold value. Let’s play with the threshold value

y_scores = sgd_clf.decision_function([some_digit])y_scores
# array([151793.48081401])
# say we set threshold value to zero
threshold = 0
# let's ask for prediction
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
# array([ True])
# now increase threshold value to 200000
threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
# array([ False])

We should also analyse how Precision Vs Recall works without threshold for a better understanding. Let’s do it

plt.plot(precisions, recalls, "b--", label="Precision")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="upper left")
plt.show()
Precision Vs Recall

If someone says “let’s reach 99% precision,” you should ask, “at what recall?”

See how the recall falls beyond 80% precision? Let’s aim for 90% precision by setting threshold value to and see our precision and recall scores.

y_train_pred_90 = (y_scores > 70000)precision_score(y_train_5, y_train_pred_90)
# 0.8659205116491548
recall_score(y_train_5, y_train_pred_90)
# 0.6993174691016417

So at 70,000 threshold we get 86.5% precision with a recall of 70%. This way we can tune our model as per needs. Its like a knob you turn.

ROC Curve

  • ROC is Receiver operating characteristic
  • Sensitivity = Recall = True Positive Rate
  • Specificity = True Negative Rate
  • False Positive Rate = 1 — Specificity
  • The more the area under the curve the better our classifier

ROC is a plot of (1- Specificity) Vs Recall

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
# fpr = false positive rate = (1 - true negative rate)
# tpr = true positive rate
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plot_roc_curve(fpr, tpr, "Roc Curve")
plt.show()
ROC curve

We can calculate area under this curve like this

from sklearn.metrics import roc_auc_scoreroc_auc_score(y_train_5, y_scores)# 0.9624496555967156

Now its time that we train another classifier and compare the results

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifierforest_clf = RandomForestClassifier(random_state=42)y_probas_forest = cross_val_predict(forest_clf, X_train,
y_train_5, cv=3,
method="predict_proba")
y_probas_forest# array([[1. , 0. ],
[0.9, 0.1],
[1. , 0. ],
...,
[1. , 0. ],
[1. , 0. ],
[1. , 0. ]])

Random Forest Classifier gives us an array of probabilities. Rows are instances and columns are classes ( not-5 or 5 )

y_scores_forest = y_probas_forest[:, 1] # we select 1st column as scores as it is the probability 
# of the positive class

Now we plot the ROC curves for this classifier. All steps are similar

fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, 
y_scores_forest)
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
ROC Curve
  • See how Random Forest is more stretched to the left covering more area?
  • So it performs better than the Stochastic Gradient Descent classifier
  • Let’s find the area under the curve
roc_auc_score(y_train_5, y_scores_forest)# 0.9931243366003829# for sgd this was 0.96

Multi Class Classification

  • Up till now we did 5 or not-5 i.e. binary classification
  • When there are more than two classes its a multi-class or multi-nomial classification
  • In our case we have 0,1,2…..9 i.e. 10 digits = 10 classes in all
  • Random Forest and Naive Bayes are capable to handle multi classes directly
  • Support Vector machine(SVM) and Linear classifiers are binomial classifiers

One Vs All Strategy

  • For our 10 digits we can train 10 binomial classifiers
  • One classifier for each digit = 1-detector, 2-detector, 3-detector and so on.
  • Then take the best decision score from each classifier and decide which digit it is.
  • So we can do multi-class classification with multiple binomial classifiers

One Vs One Strategy

  • Here we can train classifiers between digits
  • 1 vs 2 classifier, 1 vs 3 classifier, 2 vs 3 classifier and so on
  • For N digits we’ll end up with N * (N-1) / 2 classifiers. In our case its 10 * 9 / 2 = 45!
  • So basically we’ll have to train 45 binomial classifiers
  • Advantage here is each classifier has to bother only about his 2 pair of digits.

So let’s make our Stochastic Gradient classifier multi-class now

sgd_clf.fit(X_train, y_train) # y_train and not y_train_5!sgd_clf.predict([some_digit])# array([5.])

Remember in the background scikit-learn is training 10 binomial classifiers automatically. Let’s double check this

some_digit_scores = sgd_clf.decision_function([some_digit])some_digit_scores# array([[-311402.62954431, -363517.28355739, -446449.5306454 ,
# -183226.61023518, -414337.15339485, 161855.74572176,
# -452576.39616343, -471957.14962573, -518542.33997148,
# -536774.63961222]])

We get an array or 10 numbers that are scores of each classifier. Look at the 5th score!

sgd_clf.classes_# array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])# the 10 classes created by scikit learn and sgd.

We can improve the accuracy of our SGD classifier by scaling the dataset like this…

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")# array([0.91011798, 0.90874544, 0.906636 ])# so for our 3 folds we get an accuracy of # 91%, 90%, 90%

Similarly we can make our random forest classifier multi-class

forest_clf.fit(X_train, y_train)forest_clf.predict([some_digit])forest_clf.predict_proba([some_digit])# array([[0.1, 0. , 0. , 0.1, 0. , 0.8, 0. , 0. , 0. , 0. ]])

Error Analysis with Confusion Matrix

  • Confusion matrix is a great tool to pin-point where our classifier is going wrong
  • It is a grid and will help us understand exactly which digits our classifier get wrong
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)conf_mx = confusion_matrix(y_train, y_train_pred)plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
Confusion Matrix
  • Rows are actual classes
  • Columns are predicted classes
  • We are not seeing the errors yet
  • Here more gray means more error possibly
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums # to normalize the matrix
np.fill_diagonal(norm_conf_mx, 0) # to see only errors
from matplotlib.pyplot import figure
figure(num=None, figsize=(20, 10), dpi=200, facecolor='w', edgecolor='k')
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)plt.show()
Confusion Matrix
  1. Diagonal is pitch black as we wanted
  2. Brighter colors represent larger errors
  3. See columns 8 and 9? Lot of bright patches. This means many digits are mis-classified as 8 or 9
  4. See rows 8 and 9? Again brighter patches. This means 8 and 9 are often mis-classified!

There are many ways to correct this problem like image rotation, shifting, reducing noise which we cannot cover at the moment.

Summary

  1. We were able to load a dataset, shuffle it and split it into test and training parts
  2. We built binary classifiers using SGD and RandomForest and checked their accuracy
  3. We understood the concept of Precision, Recall and Roc, Area under curve
  4. F1 score, threshold values further helped us to measure the performance of our classifiers
  5. Confusion matrix helped us understand where the classifier goes wrong exactly
  6. OvO and OvA were strategies of multi-class classifiers and we saw how they work
  7. We turned our models from binary to multi-class classifiers

Give yourself a pat on the back as you just did a full fledged machine learning project!

Note of thanks — There is a great book by Aurélien Géron called Handson Machine learning with Scikit Learn and Tensor Flow. I highly recommend this book to you if you want to learn ML

--

--

Gurupratap S Matharu

Data scientist & Machine learning engineer, Student of astronomy at FCAGLP