Confusion Matrix

7 min readApr 19, 2023

A confusion matrix is a table that is used to describe the performance of a classification model on a set of data for which the true values are known. It evaluates the effectiveness of a machine learning model and helps in understanding the types and quantities of errors made by the model, and provides insights into the model’s strengths and weaknesses.

The columns represent the actual values of the target variable
The rows represent the predicted values of the target variable

A typical confusion matrix has four main components:

True Positives (TP):

These are the cases where the model predicted the positive class correctly. The actual value is Positive and predicted is also Positive.

For example, if a model is trained to classify emails as spam or non-spam, and it correctly predicts 100 emails as spam out of 150 actual spam emails, then TP would be 100.

True Negatives (TN):

These are the cases where the model predicted the negative class correctly. The actual value is Negative and prediction is also Negative.

For example, if the model correctly predicted 800 non-spam emails out of 850 actual non-spam emails, then TN would be 800.

False Positives (FP):

These are the cases where the model predicted the positive class incorrectly. The predicted value is Positive but the actual value is Negative. Also known as the Type 1 Error.

For example, if the model predicted 50 non-spam emails as spam, then FP would be 50.

False Negatives (FN):

These are the cases where the model predicted the negative class incorrectly. The predicted value is Negative but the actual value is Positive. Also known as the Type 2 Error.

For example, if the model failed to predict 20 spam emails out of 150 actual spam emails, then FN would be 20.

NOTES:

A good model is one which has high TP and TN rates, while low FP and FN rates.
If you have an imbalanced dataset to work with, it’s always better to use confusion matrix as your evaluation criteria for ML model.

Confusion Matrix Analysis :

Diagonal Elements: The diagonal elements of the confusion matrix represent the counts of correctly predicted samples for each class. For example, the value at position (0, 0) represents the count of samples correctly predicted as class 0. The higher the values on the diagonal, the better the model’s performance for those particular classes.

Off-Diagonal Elements: The off-diagonal elements represent the misclassifications made by the model. Each element at position (i, j) indicates the count of samples from class i misclassified as class j. Lower values for off-diagonal elements indicate fewer misclassifications.

Performance / Evaluation Metrics

Confusion matrix can be used to calculate various performance metrics such as accuracy, precision, recall, and F1-score which can help in evaluating the model’s effectiveness and identifying areas of improvement.

Accuracy :

The measure of how correctly a model predicts the class labels or outcomes of a set of samples. It is calculated as the ratio of the number of correct predictions to the total number of predictions, and is often expressed as a percentage. It is often used for balanced datasets where classes are equally represented.

The ratio of the number of correct predictions to the total number of predictions.

Accuracy alone may not always be a reliable metric, especially when dealing with imbalanced datasets where the classes are not equally represented. High accuracy can be achieved even if the model performs poorly on the minority class.

Precision :

The measure of how accurately a model predicts the positive class or the class of interest. It is calculated as the ratio of the number of true positive predictions (i.e., samples predicted as positive class correctly) to the total number of positive predictions (i.e., sum of true positive and false positive predictions).

“Precision is a useful metric in cases where False Positive is a higher concern than False Negatives”.

For example, in medical diagnosis or spam detection, where false positives can have serious consequences.

The ratio of correctly predicted positive class and total predicted positive class.

Recall (Sensitivity / True Positive Rate) :

It is a measure of actual observations which are predicted correctly, i.e. how many observations of positive class are actually predicted as positive.

It is also known as Sensitivity. Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.

“Recall is important when minimizing False Negatives is more critical than minimizing False Positives.”

For example, while detecting diseases or identifying fraudulent transactions, where false negatives can result in missed opportunities or serious consequences.

The ratio of correctly predicted positive class and total actual positive class.

Specificity (True Negative Rate):

It represents the ability of a model or classifier to accurately identify the true negative cases. It measures how well the model can correctly identify cases that are actually negative, out of all the cases that are truly negative.

A higher specificity value indicates that the model is more precise in identifying the true negative cases and has a lower chance of misidentifying negative cases as positive, which is important for ensuring accuracy in certain applications.

For example, in a medical diagnosis scenario where the goal is to identify whether a patient has a particular disease or not, specificity would indicate how well the model is able to correctly identify the patients who do not have the disease, avoiding false positives.

The ratio of true negatives to the sum of true negatives and false positives.

F1-Score

The F1-score is the harmonic mean of precision and recall, and provides a balanced measure that combines both metrics.
It is a good choice when both precision and recall are equally important, and it helps in finding a trade-off between precision and recall.
F1-score is often used when there is an uneven class distribution or class imbalance, and it provides a single value that summarizes the model’s performance.

If your precision is low, the F1 is low and if the recall is low again your F1 score is low.

There will be cases where there is no clear distinction between whether Precision is more important or Recall. We combine them!

F-score should be high(ideally 1).

Gist:

Accuracy is a better metric for Balanced Data.
Whenever False Positive is much more important use Precision.
Whenever False Negative is much more important use Recall.
F1-Score is used when the False Negatives and False Positives are important.
F1-Score is a better metric for Imbalanced Data.
The ideal value for Accuracy, Precision, Recall (Sensitivity), Specificity, F1 Score would be 1.0 (or 100%).

SOURCE CODE :

Confusion matrix for binary classification:

import pandas as pd, numpy as np,  matp
lotlib.pyplot as plt, seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score

# Extract predicted labels and true labels from DataFrame
y_pred = df_output["predicted_label"]
y_true = df_output["true_label"]

# Create a confusion matrix
cm = confusion_matrix(y_true,
                      y_pred)

# Plot confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d", cbar=False)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


# Calculate accuracy
accuracy = accuracy_score(y_true,y_pred)

# Calculate precision, recall, and F1-score
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')

# Print evaluation metrics
print("Accuracy: {:.4f}".format(accuracy))
print("Precision: {:.4f}".format(precision))
print("Recall: {:.4f}".format(recall))
print("F1-score: {:.4f}".format(f1))

# Create a classification report
report = classification_report(y_true, y_pred)
# Print classification report
print("Classification Report:\n", report)

Confusion matrix for multi-class classification:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Generate a sample confusion matrix (replace with your own data)
cm = confusion_matrix(y_true, y_pred)

# Define the class labels
class_labels = ['Class 1', 'Class 2', 'Class 3', 'Class 4']

# Create a heatmap of the confusion matrix with values inside it
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=ax)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
ax.set_xticklabels(class_labels, rotation=45, ha='right', fontsize=10)
ax.set_yticklabels(class_labels, rotation=0, ha='right', fontsize=10)
ax.set_title('Confusion Matrix', fontsize=14)
plt.tight_layout()
plt.show()

# Create a classification report
report = classification_report(y_true, y_pred)
print("Classification Report:\n", report)

Confusion matrix in tabular form:

import numpy as np
from sklearn.metrics import confusion_matrix
from tabulate import tabulate

# Generate a sample confusion matrix (replace with your own data)
cm = confusion_matrix(y_true, y_pred)

# Define the class labels
class_labels = ['Class 1', 'Class 2', 'Class 3', 'Class 4']

# Convert confusion matrix to a list of lists
cm_list = cm.tolist()

# Add class labels as the first row and column
cm_list.insert(0, class_labels)
for i in range(len(cm_list)):
    cm_list[i].insert(0, class_labels[i])

# Print the confusion matrix as a table with lines around it
print(tabulate(cm_list, headers='firstrow', tablefmt='grid'))