Confusion Matrix, Precision , Recall and F1-Score

Temiloluwa Awoyele
Analytics Vidhya
Published in
7 min readDec 23, 2020

Why do we have and use different machine learning metrics??

Some of us, myself included had once asked this question “Why are there various ML models?”, why don’t we just use one and stick to that one metric??

Think of it this way, the same way we can’t judge a fish by its ability to fly, a bird by its ability to swim, a snail by its ability to gallop and a horse by its ability to climb a tree, each of this animals has their strong points where they perform exceptionally well, so also do they have their weak points. This applies to Machine Learning Metrics, it is then left to us as Data Scientists or Machine Learning Engineers to decide which metric is best for the domain we are working on.

For example, we’re training a fraud detection model and after several meetings with the stakeholders, we deduce the conclusion that each customer is precious to the company and the company would love to satisfy all their customers without our model causing a problem, then we as a Data Scientist focus on minimizing the false positives since if our model predicts that a transaction is a fraudulent one and is blocks the user, the user then get pissed and moves to another service provider, bank or whatever field our company might be classified into….but wait !!!…what is a false positive?, don’t worry, you’ll understand it in this post.

The Confusion matrix, Precision-score , Recall-score and F1-Score are all classification metrics. I do remember the very first time I heard about the Confusion Matrix, the word confusion in the name prepended my mind to this words “it’s goan take a while before I figure this out”. If you’re like me, in your mind just blank out the confusing in it because we’ll be demystifying it…haha😃

Confusion Matrix

So to get started, we’ll be explaining what I call….The 4 Pillars of Confusion Matrix….

Let’s track back to our binary classification problems where we get to predict fraud or not fraud, spam or ham,churn or stay, 0’s or 1’s and lots of other possibility, we’ll be using this as a base for the four pillars, which are:

  1. True Positive
  2. False Positive
  3. True Negative
  4. False Negative
image.png

Know that positive are 1’s and negatives are 0’s, so let’s dive into the 4 building blocks of the confusion matrix

Pro Tip:

A good trick I've employed to be able to understand immediately what this four pillars stand for and not 
get confused by how they sound is to know that, the first part i.e the part with the True and False is the
part that tells us the "validity of the second part" while the second part i.e the part with the Positive
and Negative tells us "what the model predicts".

So if we hear about False Positive we know that "the model predicts positive" i.e 1 but the validity of
that is False, meaning what the model predicts is wrong. Also a True Negative means that our model predicts a Negative i.e 0 and the validity of that is True, meaning what our model predicts is correct.

True Positive: A prediction is said to be true positive when our prediction is positive i.e 1 and that is True

False Positive A prediction is said to be false positive when our prediction is positive i.e 1 and that is False

True Negative A prediction is said to be true negative when our prediction is negative i.e 0 and that is True

False Negative A prediction is said to be false negative when our prediction is negative i.e 0 and that is False

Next, we’ll talk about the Rates, which includes:

  • True Positive Rate
  • False positive Rate
  • True Negative Rate
  • False Negative Rate

But before then let’s really grasp what Positives & Negatives mean:

Positives = True Positives and False Negatives

Negatives = True Negatives and False Positives

You’ll find this very intuitive since a prediction that is a False Negative would mean that data-point is a Positive data-point and a prediction that is a False Positive would mean that data-point is a Negative data-point

True Positive Rate:

image-2.png

False Positive Rate:

image-3.png

True Negative Rate:

image-4.png

False Negative Rate:

image-5.png

So a good classifier should have high TPR , high TNR, low FPR and low FNR.

Know that confusion matrix is not restricted to only binary classification problems, it also extends to multi-class problems as well. In a multi-class problem, the numbers on the “Principal diagonal” are what we want to be high while numbers on the “off-diagonal” are what we want to reduce to the nearest minimum.

In the image below, picture a line drawn from the top-left corner to the bottom-right corner, that line is the Principal diagonal and every part not on the line is the off-diagonal

Below is what Confusion Matrix looks like for multi-class problems

image.png

Pros:

  1. Handles imbalance very well.

Cons:

  1. Does not take prediction probabilities.

Confusion Matrix from Scratch

###############################
#Code Input #
###############################
import numpy as np
from sklearn.metrics import confusion_matrix
np.random.seed(0)

targets = np.random.randint(low=0,high=2,size=100)
y_hats = np.random.randint(low=0,high=2,size=100)

print("Sklearn Confusion Matrix:",confusion_matrix(targets,y_hats),sep="\n")

def customConfusionMatrix(targets,preds):
TP = 0
FP = 0
TN = 0
FN = 0
for y,y_hat in zip(targets,preds):
if y==1 and y_hat==1:
TP += 1
elif y==0 and y_hat==0:
TN += 1
elif y==1 and y_hat==0:
FN += 1
elif y==0 and y_hat==1:
FP += 1
return np.array([[TN,FP],
[FN,TP]])
print("Custom Confusion Matrix:",customConfusionMatrix(targets,y_hats),sep="\n")
###############################
#Output #
###############################
Sklearn Confusion Matrix:
[[24 20]
[31 25]]
Custom Confusion Matrix:
[[24 20]
[31 25]]

Precision and Recall

Precision and Recall metrics are very good metrics for information retrieval. They both care more about the positive class and aren’t concerned about the negative class.

Precision (Specificity):

Precision intuitively means of all the points the model has classified or declared as positives, what percentage are actually positive?

image.png

Recall (Sensitivity):

Recall on the other hand says of all the point that are actually positive, what percentage was the model able to detect or predict?. You can see that Recall is the same as True Positive Rate we talked about in the Confusion Matrix section,since TP and FN are Positives.

image-2.png

Recall tell us how sensitive our model is to the positive class, and we see it is also referred to as Sensitivity

The precision and recall metrics can be imported from scikit-learn using

Precision and Recall both lie between 0 to 1 and the higher, the better.

###############################
#Code Input #
###############################
from sklearn.metrics import precision_score , recall_score

Precision and Recall from Scratch

###############################
#Code Input #
###############################
import numpy as np
from sklearn.metrics import precision_score , recall_score
np.random.seed(0)

targets = np.random.randint(low=0,high=2,size=100)
y_hats = np.random.randint(low=0,high=2,size=100)

sklearn_precision = precision_score(targets,y_hats)
print("Sklearn Precision = ",sklearn_precision)

sklearn_recall = recall_score(targets,y_hats)
print("Sklearn Recall = ",sklearn_recall)

def customPrecision(targets,preds):
TP = 0
FP = 0
for y,y_hat in zip(targets,preds):
if y==1 and y_hat==1:
TP += 1
elif y == 0 and y_hat==1:
FP +=1

return TP / (TP + FP)

print("Custom Precision = ",customPrecision(targets,y_hats))

def customRecall(targets,preds):
TP = 0
FN = 0
for y,y_hat in zip(targets,preds):
if y==1 and y_hat==1:
TP += 1
elif y == 1 and y_hat==0:
FN +=1

return TP / (TP + FN)

print("Custom Recall = ",customRecall(targets,y_hats))
###############################
#Output #
###############################
Sklearn Precision = 0.5555555555555556
Sklearn Recall = 0.44642857142857145
Custom Precision = 0.5555555555555556
Custom Recall = 0.44642857142857145

F1 Score

F1 score is a metric that tries to combine both Precision and Recall

The f1-score metric can be imported from scikit-learn using

###############################
#Code Input #
###############################
from sklearn.metrics import f1_score

Also f1-score lies between 0 to 1 and the higher, the better.

The formula for F1 Score is

image.png

F1 Score from Scratch

###############################
#Code Input #
###############################
import numpy as np
from sklearn.metrics import f1_score
np.random.seed(0)

targets = np.random.randint(low=0,high=2,size=100)
y_hats = np.random.randint(low=0,high=2,size=100)

sklearn_f1_score = f1_score(targets,y_hats)

def customF1Score(targets,preds):
def customPrecision(targets,preds):
TP = 0
FP = 0
for y,y_hat in zip(targets,preds):
if y==1 and y_hat==1:
TP += 1
elif y == 0 and y_hat==1:
FP +=1

return TP / (TP + FP)

def customRecall(targets,preds):
TP = 0
FN = 0
for y,y_hat in zip(targets,preds):
if y==1 and y_hat==1:
TP += 1
elif y == 1 and y_hat==0:
FN +=1

return TP / (TP + FN)
precision = customPrecision(targets,preds)
recall = customRecall(targets,preds)

return 2 * (precision * recall) / (precision + recall)


print("Sklearn F1_Score = ",sklearn_f1_score)
print("Custom F1_Score = ",customF1Score(targets,y_hats))
###############################
#Output #
###############################
Sklearn F1_Score = 0.4950495049504951
Custom F1_Score = 0.4950495049504951

Thanks for Reading I hope I’ve given you some understanding of some classification metrics. A little bit of motivation will be appreciated and you can do that by giving a clap👏. I am also open to questions and suggestions. You can share this with friends and other people or post on any of your favorite social media platforms so someone who needs this might stumble on this.

You can reach me on:

LinkedIn: https://www.linkedin.com/in/temiloluwa-awoyele/

Twitter: https://twitter.com/temmyzeus100

Github: https://github.com/temmyzeus

--

--