Confusion Matrix Simplified!

Mukesh Kumar
Accredian
Published in
14 min readJun 3, 2022

Preface

Photo by bruce mars on Unsplash

Have you ever got confused or failed to grasp understanding the confusion matrix while evaluating the performances of algorithms when it comes to classification problems? Do not worry. I will walk you through the entire mathematics with and without code accompanying one application to help you understand it better. I will also show its extension to multiple classes, for example, three. Once you learn the logic and maths, you will be able to answer questions for N classes.

About Confusion Matrix & their Elements

A confusion matrix is a tabular representation of data that explains the performance of a supervised learning-based algorithm. It demonstrates the actual values count and predicted values count by the algorithm. A general binary classification-based confusion matrix looks like as shown below:

  • True Negatives (TN): These are referred to as the negative data points the classifier labeled correctly.
  • True Positives (TP): These are referred to as the positive data points the classifier labeled correctly.
  • False Positives (FP): These are the negative data points that got misclassified as positive. They are also known as Type 1 Error.
  • False Negatives (FN): These are the positive data points that got misclassified as negative ones. They are also known as Type 2 Error.
  • Total Actual Negatives (N): The count explains the total number of actual negative labels available in the dataset.
  • Total Actual Positives (P): The count illustrates the total number of actual positive labels available in the dataset.
  • Total Predicted Negatives (N’): The count explains the overall number of predicted negative labels.
  • Total Predicted Positives (P’): The count describes the overall number of predicted positive labels.
  • Total Labels ((P + N) or (P’ + N’)): The count explains the total number of class labels present in the dataset.

Evaluation Metrics for Classification Problems

NOTE: You can skip this section if you are getting maths heavy. To grasp the concept, you need to look at these formulas side by side with the confusion matrix given in the above section. In the next section, I will walk you through the maths for the binary class problems.

The confusion matrix itself can only explain numbers and their meaning. However, we can use the confusion matrix to generate some essential metrics that we can use to evaluate the performance of algorithms concerning each class. Let’s go through these metrics one by one as follows:

1. The accuracy or Overall Recognition Rate

This explains the percentage of the test set data points correctly predicted by the classifier.

2. Error Rate or Misclassification Rate

This depicts the percentage of test set data points incorrectly classified by the classifier. If the model’s error rate is estimated using the train set, this quantity is known as a resubstitution error.

3. Precision

It can also be thought of as a measure of exactness. In other words, the accuracy of positive or negative predictions.

  • Precision for Positive Class → What percentage of records labeled as positive are actually such?
  • Precision for Negative Class → What percentage of records labeled as negative are actually such?

4. Recall

It is a measure of completeness. In other words, the ratio of positive or negative data points that got correctly detected by the classifier.

  • Recall for Positive Class or (Sensitivity) or (True Positive Rate) → What percentage of positive data points got labeled as such?
  • Recall for Negative Class or (Specificity) or (True Negative Rate) → What percentage of negative data points got labeled as such?

5. F1-Score

It measures the harmonic mean of precision and recall. The regular arithmetic mean treats all values uniformly, whereas the harmonic mean (weighted average) gives more weight to low values. As a result, the F1 score will only get high if both precision and recall are high.

6. Fβ Score

It is a weighted measure of precision and recall. It assigns β times as much weight to recall as to precision value. The most commonly used β are 0.5 and 2. If β is higher, then F-score moves towards recall and vice versa.

7. Macro Average

It is a measure that averages the precision or recall of positive and negative classes without considering weights.

  • Macro Average for Precision →
  • Macro Average for Recall →

8. Weighted Average

It measures the averages of the precision or recall of positive and negative classes by including weights (support count).

  • Weighted Average for Precision →
  • Weighted Average for Recall →

About the Dataset and the Application

Animation by Shubham Kumar

I will use the MNIST dataset of images containing 70000 records and 785 features. Although the data is about the classification of digits 0 to 9, we will transform the data into a binary classification problem that specifies whether an image contains digit five (5) or not.

Simple Mathematics vs Classification Report

Next, I will show the manual execution and the library-based working of the evaluation metrics. But before that, let’s import some important libraries that we need for the implementation part.

Importing of Libraries

# For numerical python functionsimport numpy as np
# To load the MNIST datafrom sklearn.datasets import fetch_openmlfrom sklearn.datasets import load_iris# To plot pretty figuresimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline# To perform model development and data splittingfrom sklearn.linear_model import SGDClassifierfrom sklearn.model_selection import train_test_split# Evaluation metrics for classification-based problemfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import classification_reportfrom sklearn.metrics import plot_confusion_matrixfrom sklearn.metrics import precision_recall_fscore_support

Loading, Splitting & Filtering of the Data

We will load the MNIST data and split it into the training and the testing set. We will take 60000 records for training and 10000 for testing the model.

# Download the MNIST data of specific version
mnist = fetch_openml(name = 'mnist_784', version = 1)
# Split the data into input and output
X, y = mnist['data'], mnist['target']
# Typecast output as integer type
y = y.astype(int)
# Display the shape of the downloaded dataset
print('MNIST Shape:', X.shape, y.shape)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# Transform outputs to implement binary-class problem
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

Model Development & Confusion Matrix

We will use Stochastic Gradient Descent for developing a model because it is easy and quick to train on a large dataset. Along with it, we will also plot the confusion matrix on the test data.

# Initialize the stochastic gradient descent model
sgd_clf = SGDClassifier(random_state = 42)
# Fit the model
sgd_clf.fit(X_train, y_train_5)
# Predict the results
y_pred = sgd_clf.predict(X_test)
# Initialize a figure of size 10 X 7 inches
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = [10, 7])
# Call confusion matrix method on sgd classifierplot_confusion_matrix(estimator = sgd_clf, X = X_test, y_true = y_test_5, values_format = '.5g', ax = ax, cmap = 'PuBuGn')# Add some cosmetics to the figure
ax.set_xticklabels(labels = ['Negative', 'Positive'], size = 14)
ax.set_yticklabels(labels = ['Negative', 'Positive'], size = 14)
ax.set_xlabel(xlabel = 'Predicted', size = 14)
ax.set_ylabel(ylabel = 'Actual', size = 14)
# Display the figure
plt.show()

Manual Estimation of Evaluation Metrics

  • The accuracy or Overall Recognition Rate:
Input:# Calculate accuracy (TP + TN)/(TP + TN + FP + FN)
accuracy = (8707 + 785) / (8707 + 785 + 107 + 401)
# Display the accuracy
print('Accuracy:', np.round(accuracy, decimals = 2))
Output:Accuracy: 0.95
  • Error Rate or Misclassification Rate:
Input:# Calculate error rate (FP + FN)/(TP + TN + FP + FN)
error_rate = (401 + 107) / (8707 + 785 + 107 + 401)
# Display the error rate
print('Error Rate:', np.round(error_rate, decimals = 2))
Output:Error Rate: 0.05
  • Precision (for positive and negative class):
Input:# Calculate precision for positive class TP/(TP + FP)
precision_positive = 785/(785 + 401)
# Calculate precision for negative class TN/(TN + FN)
precision_negative = 8707/(8707 + 107)
# Display the precision for positive and negative class
print('Precision (+):', np.round(precision_positive, decimals = 2))
print('Precision (-):', np.round(precision_negative, decimals = 2))
Output:Precision (+): 0.66
Precision (-): 0.99
  • Recall (for positive and negative class):
Input:# Calculate recall for positive class TP/(TP + FN)
recall_positive = 785/(785 + 107)
# Calculate recall for negative class TN/(TN + FP)
recall_negative = 8707/(8707 + 401)
# Display the precision for positive and negative class
print('Recall (+):', np.round(recall_positive, decimals = 2))
print('Recall (-):', np.round(recall_negative, decimals = 2))
Output:Recall (+): 0.88
Recall (-): 0.96
  • F1-score (for positive and negative classes):
Input:# Calculate f1-score for positive class 
f1_positive = 2 * (precision_positive * recall_positive) /(precision_positive + recall_positive)
# Calculate f1-score for negative class
f1_negative = 2 * (precision_negative * recall_negative) /(precision_negative + recall_negative)
# Display the f1-score for positive and negative class
print('F1-score (+):', np.round(f1_positive, decimals = 2))
print('F1-score (-):', np.round(f1_negative, decimals = 2))
Output:F1-score (+): 0.76
F1-score (-): 0.97
  • Macro Average (for precision, recall, and F1-score):
Input:# Calculate macro average for precision 
macro_avg_precision = (precision_positive + precision_negative) / 2
# Calculate macro average for recall
macro_avg_recall = (recall_positive + recall_negative) / 2
# Calculate macro average for f1-score
macro_avg_f1 = (f1_positive + f1_negative) / 2
# Display the macro averages for precision, recall, and f1-score
print('Macro Average Precision:', np.round(macro_avg_precision, decimals = 2))
print('Macro Average Recall:', np.round(macro_avg_recall, decimals = 2))
print('Macro Average F1-score:', np.round(macro_avg_f1, decimals = 2))
Output:Macro Average Precision: 0.82
Macro Average Recall: 0.92
Macro Average F1-score: 0.86
  • Weighted Average (for precision, recall, and F1-score):
Input:# Calculate weighted average for precision 
weighted_avg_precision = (precision_true * 892 + precision_false * 9108) / (9108 + 892)
# Calculate weighted average for recall
weighted_avg_recall = (recall_true * 892 + recall_false * 9108) / (9108 + 892)
# Calculate weighted average for f1-score
weighted_avg_f1 = (f1_true * 892 + f1_false * 9108) / (9108 + 892)
# Display the weighted averages for precision, recall, and f1-score
print('Macro Average Precision:', np.round(weighted_avg_precision, decimals = 2))
print('Macro Average Recall:', np.round(weighted_avg_recall, decimals = 2))
print('Macro Average F1 Score:', np.round(weighted_avg_f1, decimals = 2))
Output:Macro Average Precision: 0.96
Macro Average Recall: 0.95
Macro Average F1 Score: 0.95

Classification Report Results

# Display classification report using actual and predicted values
print(classification_report(y_test_5, y_pred, target_names=['Negative', 'Positive']))

Extraction of Evaluation Metrics for 3 Classes

Now that you are aware of the mathematics and execution of evaluation metrics of the binary classification problem, I will explain the working of the three-class problem. It will give you the intuition to extract metrics for as many classes as possible.

About the Dataset and the Application

I will use a very famous data set known as the Iris dataset. It has three different species of a flower, which I will use to explain the extraction of evaluation metrics. The data has 150 rows and five (5) features, out of which 1 is the target.

Loading and Splitting of the Data

We will load the Iris data and split it into the training (75%) and the testing set (25%).

# Load the Iris data
iris = load_iris()
# Split the data into input and output
X, y = iris.data, iris.target
# Display the shape of the downloaded dataset
print('Iris Shape:', X.shape, y.shape)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

Model Development & Confusion Matrix

We will again use Stochastic Gradient Descent for developing a model. Along with it, we will also plot the confusion matrix on the test data.

# Initialize the stochastic gradient descent model
sgd_clf = SGDClassifier(random_state = 42)
# Fit the model
sgd_clf.fit(X_train, y_train)
# Predict the results
y_pred = sgd_clf.predict(X_test)
# Initialize a figure of size 10 X 7 inches
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = [10, 7])
# Call confusion matrix method on sgd classifierplot_confusion_matrix(estimator = sgd_clf, X = X_test, y_true = y_test, values_format = '.5g', ax = ax, cmap = 'PuBuGn')# Add some cosmetics to the figure
ax.set_xticklabels(labels = ['Class 0', 'Class 1', 'Class 2'], size = 14)
ax.set_yticklabels(labels = ['Class 0', 'Class 1', 'Class 2'], size = 14)
ax.set_xlabel(xlabel = 'Predicted', size = 14)
ax.set_ylabel(ylabel = 'Actual', size = 14)
# Display the figure
plt.show()

Manual Estimation of Evaluation Metrics

The only difference that will reflect while assessing the algorithm’s performance will happen in estimating precision and recall values.

  • The accuracy or Overall Recognition Rate:
Input:# Calculate accuracy: sum of diagonal values / total sum
accuracy = (11 + 13 + 7) / (11 + 13 + 7 + 1 + 6)
# Display the accuracy
print('Accuracy:', np.round(accuracy, decimals = 2))
Output:Accuracy: 0.82
  • Error Rate or Misclassification Rate:
Input:# Calculate error rate: sum of values except diagonal / total sum
error_rate = (1 + 0 + 0 + 0 + 0 + 6) / (11 + 13 + 7 + 1 + 6)
# Display the error rate
print('Error Rate:', np.round(error_rate, decimals = 2))
Output:Error Rate: 0.18
  • Precision (for classes 0, 1, and 2): Let’s say you are interested in estimating the precision of class 2. Search for class 2 as an index and column. In our scenario, it is present in the third row and third column. Now visualize in your head that this cell will not change its cell[2, 2], which contains the value 7. Add two columns of class 0 and class 1 vertically first. Your entire matrix will shrink to 3 X 2 shape giving values [[12, 0], [13, 0], [6, 7]. Next, add two rows of class 0 and class 1 horizontally, which will shrink the entire matrix to a 2 X 2 shape. This is exactly what we need, and the entire 3 X 3 matrix has transformed into a 2 X 2 shape. The final matrix contains values [[25, 0], [7, 7]]. We can think of this matrix as a binary confusion matrix and get the precision and recall values.
Input:# Calculate precision for class 0
class_0_precision = 11 / (11 + 0 + 0)
# Calculate precision for class 1
class_1_precision = 13 / (13 + 1 + 6)
# Calculate precision for class 2
class_2_precision = 7 / (7 + 0 + 0)
# Display the precision for class 0, 1, and 2
print('Precision (0):', np.round(class_0_precision, decimals = 2))
print('Precision (1):', np.round(class_1_precision, decimals = 2))
print('Precision (2):', np.round(class_2_precision, decimals = 2))
Output:Precision (0): 1.00
Precision (1): 0.65
Precision (2): 1.00
  • Recall (for classes 0, 1, and 2): Similarly, we can estimate the recall for all the classes. Transform the matrix from 3 X 3 to 2 X 2 shape, and the rest of the maths is the same.
Input:# Calculate recall for class 0
class_0_recall = 11 / (11 + 1 + 0)
# Calculate recall for class 1
class_1_recall = 13 / (13 + 0 + 0)
# Calculate recall for class 2
class_2_recall = 7 / (7 + 6 + 0)
# Display the recall for class 0, 1, and 2
print('Recall (0):', np.round(class_0_recall, decimals = 2))
print('Recall (1):', np.round(class_1_recall, decimals = 2))
print('Recall (2):', np.round(class_2_recall, decimals = 2))
Output:Recall (0): 0.92
Recall (1): 1.00
Recall (2): 0.54
  • F1-score (for positive and negative classes):
Input:# Calculate f1-score for class 0 
class_0_f1 = (2 * class_0_precision * class_0_recall) / (class_0_precision + class_0_recall)
# Calculate f1-score for class 1
class_1_f1 = (2 * class_1_precision * class_1_recall) / (class_1_precision + class_1_recall)
# Calculate f1-score for class 2
class_2_f1 = (2 * class_2_precision * class_2_recall) / (class_2_precision + class_2_recall)
# Display the f1-score for class 0, 1, and 2
print('F1-score (0):', np.round(class_0_f1, decimals = 2))
print('F1-score (1):', np.round(class_1_f1, decimals = 2))
print('F1-score (2):', np.round(class_2_f1, decimals = 2))
Output:F1-score (0): 0.96
F1-score (1): 0.79
F1-score (2): 0.70
  • Macro Average (for precision, recall, and F1-score):
Input:# Calculate macro average for precision 
macro_avg_precision = (class_0_precision + class_1_precision + class_2_precision) / 3
# Calculate macro average for recall
macro_avg_recall = (class_0_recall + class_1_recall + class_2_recall) / 3
# Calculate macro average for f1-score
macro_avg_f1 = (class_0_f1 + class_1_f1 + class_2_f1) / 3
# Display the macro averages for precision, recall, and f1-score
print('Macro Average Precision:', np.round(macro_avg_precision, decimals = 2))
print('Macro Average Recall:', np.round(macro_avg_recall, decimals = 2))
print('Macro Average F1 Score:', np.round(macro_avg_f1, decimals = 2))
Output:Macro Average Precision: 0.88
Macro Average Recall: 0.82
Macro Average F1-score: 0.81
  • Weighted Average (for precision, recall, and F1-score):
Input:# Calculate weighted average for precision 
weighted_avg_precision = (12 * class_0_precision + 13 * class_1_precision + 13 * class_2_precision) / (12 + 13 + 13)
# Calculate weighted average for recall
weighted_avg_recall = (12 * class_0_recall + 13 * class_1_recall + 13 * class_2_recall) / (12 + 13 + 13)
# Calculate weighted average for f1-score
weighted_avg_f1 = (12 * class_0_f1 + 13 * class_1_f1 + 13 * class_2_f1) / (12 + 13 +13)
# Display the weighted averages for precision, recall, and f1-score
print('Weighted Average Precision:', np.round(weighted_avg_precision, decimals = 2))
print('Weighted Average Recall:', np.round(weighted_avg_recall, decimals = 2))
print('Weighted Average F1 Score:', np.round(weighted_avg_f1, decimals = 2))
Output:Macro Average Precision: 0.88
Macro Average Recall: 0.82
Macro Average F1 Score: 0.81

Classification Report Results for 3 Classes

# Display classification report using actual and predicted values
print(classification_report(y_test_5, y_pred, target_names=['Class 0', 'Class 1', 'class 2']))

Please note that you can compress the confusion matrix for N classes and generate metrics accordingly. You only need to transform the N class matrix into a 2 X 2 format. During the interview, many interviewers ask questions about the extension of extracting precision and recall values for more than two classes. If you have understood the logic and the maths, you will be good to go.

& That’s it. I hope you liked this elaborated article on the confusion matrix and learned something valuable.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their journey in Computer Science, Data Science and AI. If you are one of them and looking for a way to counterbalance these cons, then Follow me and Subscribe for more forthcoming articles related to Python, Computer Science, Data Science, Machine Learning, and Artificial Intelligence.

If you find this read helpful, then hit the Clap👏. Your encouragement will catalyze inspiration to keep me going and develop more cool stuff like this.

--

--

Mukesh Kumar
Accredian

Data Scientist, having a robust math background, skilled in predictive modeling, data processing, and mining strategies to solve challenging business problems.