Learn to do Image Classification using Stochastic Gradient Descent and Random Forest Classifier

Gurupratap S Matharu

11 min readMar 2, 2019

Introduction

A machine learning classification model that is trained on the MNIST images dataset from the ML data.org repository.
We are doing supervised learning here and our aim is to do image classification and noise reduction
During our journey we’ll understand the important tools needed to develop a powerful ML model
Our aim is to play with tools like Stochastic Gradient Descent, Random Forest, confusion matrix, Precision, Recall, ROC curves, Area under curve and cross validation to reach our goal.
We’ll evaluate the performance of each of our classifier using Precision scores, Recall scores, and also tune hyper parameters to further optimize our model
We’ll validate our predictions against our test data set and conclude our learning

To do an end-to-end Machine Learning project we need to do the following steps

1. Understand the requirements of the business.

2. Acquire the data set.

3. Visualize the data to understand it better and develop our intuition.

4. Pre-process the data to make it ready to feed to our ML model.

5. Try various models and train them. Select one that we find best.

6. Fine-tune our model by tuning hyper-parameters

7. Present our solution to the team.

8. Launch, monitor, and maintain our system.

1. Understand the requirements of the business

We are enthusiastic data scientists and before starting we need to ask some fundamental questions

Why does our organisation need this classifier or machine learning model?

- possibly we have a software product and adding image recognition capabilities could be a great advantage
- the organisation will use this data to feed another machine learning model
- current process is good but manual and time consuming
- our organisation wants an edge over competition
- we want to reduce noise from existing corrupted images and this data is valuable

Acquire the Dataset

import numpy as np
from six.moves import urllib
from sklearn.datasets import fetch_mldata
try:
    mnist = fetch_mldata('MNIST original')
except urllib.error.HTTPError as ex:
    print("Could not download MNIST data from mldata.org, trying alternative...")
    # Alternative method to load MNIST, if mldata.org is down
    from scipy.io import loadmat
    mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
    mnist_path = "./mnist-original.mat"
    response = urllib.request.urlopen(mnist_alternative_url)
    with open(mnist_path, "wb") as f:
        content = response.read()
        f.write(content)
    mnist_raw = loadmat(mnist_path)
    mnist = {
        "data": mnist_raw["data"].T,
        "target": mnist_raw["label"][0],
        "COL_NAMES": ["label", "data"],
        "DESCR": "mldata.org dataset: mnist-original",
    }
    print("Success!")

This might take some time. If it doesn’t work then download the dataset here https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat

Save it in your current directory and then run the following code.

from scipy.io import loadmat
mnist_path = './mnist-original.mat'
mnist_raw = loadmat(mnist_path)
mnist = {
    "data": mnist_raw["data"].T,
    "target": mnist_raw["label"][0],
    "COL_NAMES": ["label", "data"],
    "DESCR": "mldata.org dataset: mnist-original",
}
print("Success!")

Let’s check our dataset

mnist
{'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([0., 0., 0., ..., 9., 9., 9.]),
 'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original'}

The MNIST dataset is already divided into data and target labels. Let’s extract it now

X, y = mnist["data"], mnist["target"]X.shape (70000, 784)# so our data has 70000 instances(rows) and 784 features(columns)y.shape # our target label(70000,)

Let’s import Matplotlib and analyze an image

%matplotlib inline
import matplotlib 
import matplotlib.pyplot as pltsome_digit = X[36000]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
          interpolation="nearest")
plt.axis("off")
plt.show()

# Let's cross check ity[36000]5.0

We divide the data set into 60,000 instances for training and the remaining 10,000 for test part like this…

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

There is something called Shuffling. It helps us to randomize our dataset before we start working with it. This is recommended. Let’s do it

shuffle_index = np.random.permutation(60000) # creates a random arrayX_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

Simple Binary Classification

For simplicity we’ll build a simple classifier that only detects if an image has 5 in it or not. Since answer is True or False this will be a binary classification.

y_train_5 = (y_train == 5) # Only true for 5s and False for all other digits# Remember y_train is all our target labels# y_train_5 is target labels with boolean values only for 5's# could be tricky but a nice shortcut# we repeat the same thing for our test datasety_test_5 = (y_test == 5)

Stochastic Gradient Descent Classifier

This is a good classifier to start with it as it large datasets efficiently and handles training instances independently

from sklearn.linear_model import SGDClassifiersgd_clf = SGDClassifier(random_state=42) # instantiatesgd_clf.fit(X_train, y_train_5) # train the classifiersgd_clf.predict([some_digit]) # make it predict some digit# array([ True])

For me it works and the classifier seems to work. What is your result? Shall we plot a few more digits?

for index in range(1000, 70000, 1000):
    digit = X[index]
    predicted = sgd_clf.predict([digit])
    if predicted:
        print("{} == {}".format(predicted, y[index]), end=",")# prints[ True] == 1.0
[ True] == 1.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 5.0
[ True] == 8.0
[ True] == 5.0# Left: Our classifier claims that it is 5
# Right: Actual value of the digit
# So there are failures. Let's proceed into more investigation

Performance Measures

Let’s evaluate our SGD classifier by using K-fold Cross Validation. We’ll use 3 folds and we’ll get a score for each fold.

from sklearn.model_selection import cross_val_scorecross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")# array([0.96485, 0.95 , 0.9494 ])# We get accuracy for each fold = 96.45%, 95%, 94.94%

This is not the best way as we are doing classification and not regression. But we have a powerful tool with us called…

Confusion Matrix

It is a grid of all labels against all labels for our classifier
Helps us identify which labels our classifier is predicting wrong
To plot it we need prediction scores. So lets evaluate them

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrixy_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)confusion_matrix(y_train_5, y_train_pred)# Remember our sgd_clf is a binary (5 and not-5) classifier. 
# So we get a 2X2 arrayarray([[53361,  1218],
       [ 1497,  3924]])# ideal confusion matrix has non-zero elements only in the 
# diagonal like this...
confusion_matrix(y_train_5, y_train_5)# array([[54579,     0],
#       [    0,  5421]])

Precision and Recall

These are fundamental tools to evaluate and fine tune a classifier. Consider this example

If our classifier claims that “these 10 images have an apple in it” but in reality only 8 images contain apples then the precision is 0.8 or 80%
If we give 10 images with apples to our classifier but it recognizes only 7 and rejects 3 then its recall is 0.7 or 70%
If we aim for higher precision we compromise on recall and vice versa
Ideally we want both high precision and high recall. But we need to find a thresholdvalue so that there is a trade-off

from sklearn.metrics import precision_score, recall_scoreprecision_score(y_train_5, y_train_pred)# 0.7631271878646441recall_score(y_train_5, y_train_pred)# 0.7238516878804648

F1 — Score

It is the harmonic mean of precision and recall and a more convenient way to compare a classifier. Let’s calculate it

from sklearn.metrics import f1_scoreprint("My dear the F1 score is = ", f1_score(y_train_5, y_train_pred))# My dear the F1 score is =  0.74297074694689

So how do we know which threshold to use for our classifier?

First we need to find the decision scores of our classifier by calling for `decision function` instead of accuracy
Then we plot Precision and Recall curves
From that curve we can conclude a threshold score

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                            method="decision_function")# array([-1128277.51063884,  -166965.96182519, -1236785.06540279, ...,
#        -334287.64724002,  -289424.98623176,  -982906.79459895])from sklearn.metrics import precision_recall_curveprecisions, recalls, thresholds = precision_recall_curve(y_train_5,
                                                         y_scores)

It is recommended to write a Matplot lib function to plot the curve

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])

Let’s plot it

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

See that intersection point? There is our key for the threshold value. Let’s play with the threshold value

y_scores = sgd_clf.decision_function([some_digit])y_scores
# array([151793.48081401])# say we set threshold value to zero
threshold = 0# let's ask for prediction
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
# array([ True])# now increase threshold value to 200000
threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
# array([ False])

We should also analyse how Precision Vs Recall works without threshold for a better understanding. Let’s do it

plt.plot(precisions, recalls, "b--", label="Precision")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="upper left")
plt.show()

If someone says “let’s reach 99% precision,” you should ask, “at what recall?”

See how the recall falls beyond 80% precision? Let’s aim for 90% precision by setting threshold value to and see our precision and recall scores.

y_train_pred_90 = (y_scores > 70000)precision_score(y_train_5, y_train_pred_90)
# 0.8659205116491548recall_score(y_train_5, y_train_pred_90)
# 0.6993174691016417

So at 70,000 threshold we get 86.5% precision with a recall of 70%. This way we can tune our model as per needs. Its like a knob you turn.

ROC Curve

ROC is Receiver operating characteristic
Sensitivity = Recall = True Positive Rate
Specificity = True Negative Rate
False Positive Rate = 1 — Specificity
The more the area under the curve the better our classifier

ROC is a plot of (1- Specificity) Vs Recall

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)# fpr = false positive rate = (1 - true negative rate)
# tpr = true positive ratedef plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    
plot_roc_curve(fpr, tpr, "Roc Curve")
plt.show()

We can calculate area under this curve like this

from sklearn.metrics import roc_auc_scoreroc_auc_score(y_train_5, y_scores)# 0.9624496555967156

Now its time that we train another classifier and compare the results

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifierforest_clf = RandomForestClassifier(random_state=42)y_probas_forest = cross_val_predict(forest_clf, X_train,
                                    y_train_5, cv=3,
                                    method="predict_proba")y_probas_forest# array([[1. , 0. ],
       [0.9, 0.1],
       [1. , 0. ],
       ...,
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ]])

Random Forest Classifier gives us an array of probabilities. Rows are instances and columns are classes ( not-5 or 5 )

y_scores_forest = y_probas_forest[:, 1] # we select 1st column as scores as it is the probability 
# of the positive class

Now we plot the ROC curves for this classifier. All steps are similar

fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, 
                                                      y_scores_forest)plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

See how Random Forest is more stretched to the left covering more area?
So it performs better than the Stochastic Gradient Descent classifier
Let’s find the area under the curve

roc_auc_score(y_train_5, y_scores_forest)# 0.9931243366003829# for sgd this was 0.96

Multi Class Classification

Up till now we did 5 or not-5 i.e. binary classification
When there are more than two classes its a multi-class or multi-nomial classification
In our case we have 0,1,2…..9 i.e. 10 digits = 10 classes in all
Random Forest and Naive Bayes are capable to handle multi classes directly
Support Vector machine(SVM) and Linear classifiers are binomial classifiers

One Vs All Strategy

For our 10 digits we can train 10 binomial classifiers
One classifier for each digit = 1-detector, 2-detector, 3-detector and so on.
Then take the best decision score from each classifier and decide which digit it is.
So we can do multi-class classification with multiple binomial classifiers

One Vs One Strategy

Here we can train classifiers between digits
1 vs 2 classifier, 1 vs 3 classifier, 2 vs 3 classifier and so on
For N digits we’ll end up with N * (N-1) / 2 classifiers. In our case its 10 * 9 / 2 = 45!
So basically we’ll have to train 45 binomial classifiers
Advantage here is each classifier has to bother only about his 2 pair of digits.

So let’s make our Stochastic Gradient classifier multi-class now

sgd_clf.fit(X_train, y_train) # y_train and not y_train_5!sgd_clf.predict([some_digit])# array([5.])

Remember in the background scikit-learn is training 10 binomial classifiers automatically. Let’s double check this

some_digit_scores = sgd_clf.decision_function([some_digit])some_digit_scores# array([[-311402.62954431, -363517.28355739, -446449.5306454 ,
#        -183226.61023518, -414337.15339485,  161855.74572176,
#        -452576.39616343, -471957.14962573, -518542.33997148,#        -536774.63961222]])

We get an array or 10 numbers that are scores of each classifier. Look at the 5th score!

sgd_clf.classes_# array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])# the 10 classes created by scikit learn and sgd.

We can improve the accuracy of our SGD classifier by scaling the dataset like this…

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")# array([0.91011798, 0.90874544, 0.906636  ])# so for our 3 folds we get an accuracy of # 91%, 90%, 90%

Similarly we can make our random forest classifier multi-class

forest_clf.fit(X_train, y_train)forest_clf.predict([some_digit])forest_clf.predict_proba([some_digit])# array([[0.1, 0. , 0. , 0.1, 0. , 0.8, 0. , 0. , 0. , 0. ]])

Error Analysis with Confusion Matrix

Confusion matrix is a great tool to pin-point where our classifier is going wrong
It is a grid and will help us understand exactly which digits our classifier get wrong

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)conf_mx = confusion_matrix(y_train, y_train_pred)plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

Rows are actual classes
Columns are predicted classes
We are not seeing the errors yet
Here more gray means more error possibly

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums # to normalize the matrix
np.fill_diagonal(norm_conf_mx, 0) # to see only errors
from matplotlib.pyplot import figure
figure(num=None, figsize=(20, 10), dpi=200, facecolor='w', edgecolor='k')
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)plt.show()

Diagonal is pitch black as we wanted
Brighter colors represent larger errors
See columns 8 and 9? Lot of bright patches. This means many digits are mis-classified as 8 or 9
See rows 8 and 9? Again brighter patches. This means 8 and 9 are often mis-classified!

There are many ways to correct this problem like image rotation, shifting, reducing noise which we cannot cover at the moment.

Summary

We were able to load a dataset, shuffle it and split it into test and training parts
We built binary classifiers using SGD and RandomForest and checked their accuracy
We understood the concept of Precision, Recall and Roc, Area under curve
F1 score, threshold values further helped us to measure the performance of our classifiers
Confusion matrix helped us understand where the classifier goes wrong exactly
OvO and OvA were strategies of multi-class classifiers and we saw how they work
We turned our models from binary to multi-class classifiers

Give yourself a pat on the back as you just did a full fledged machine learning project!

Note of thanks — There is a great book by Aurélien Géron called Handson Machine learning with Scikit Learn and Tensor Flow. I highly recommend this book to you if you want to learn ML

Learn to do Image Classification using Stochastic Gradient Descent and Random Forest Classifier

Introduction

To do an end-to-end Machine Learning project we need to do the following steps

Written by Gurupratap S Matharu