Should you always rely on accuracy for evaluating your model ?

Maru Tech 🐞
Data And Beyond
Published in
5 min readNov 30, 2022

HELLO community ^^!

I hope everything’s groovy

Well as I was scrolling down in my linkedin feed I suddenly found this meme

what actually inspired me to write my first ever article in the aim of sharing my humble knowledge in the sake of enabling people to avoid mistakes that I made since my beginnings in this field , well let’s begin 🎉

As we know in most studies related to machine learning techniques, in addition to the quality of information, its size plays a major role in obtaining impressive results , as its lack or presence in uneven proportions with different distributions may negatively affect, if not dangerously , revenues and decisions , especially in the medical field , which by consequence may lead to an unforgivable errors against human .

for this , in this post I wanted to shed light on one of the most important mistakes during the application and selection of machine learning models , which is the problem of imbalanced dataset .

What is imbalanced dataset ?

similarly to what you see in the above picture The ID (imbalanced dataset) phenomenon is represented by the presence of uneven and varying quantities between dataset classes i.e one class label has a very high number of observations and the other has a very low number of observations .

For example, if we take the problem of classifying the pneumonia chest x-ray images , we will find that the number of patients suffering from pneumonia is much less compared to healthy people. In this case, it will be difficult to collect informations for both classes with a ratio of Identical , sufficient and high quality, which leads to a shortage in one of the classes, thus an imbalanced dataset .

To give you an idea about its consequences let’s look at a concrete example .

if we took a dataset containing 1000 images labeled as pneumonnia (Positive class) and 90 images labeled as not pneumonia (negative class) with the totall of 1090 instance in the whole dataset .

and when calculating the accuracy = (tp + tn) / total we found it equal to 0.91 … so good !!!! isn’t it

.

.

.

well not at all , actually we can’t even tell utill we look more deeply inside the box in other words we must look into the confusion matrix .

What is a confusion matrix ?

A confusion matrix , has four entries

(TP) The cases in which the patients did have pneumonia and our model also predicted as having it

(TN) The cases in which the patients actually did not have pneumonia and our model also predicted as not having it

(FP) The cases in which the patients actually did not have pneumonia but our model predicted having it

(FN) The cases in which the patients actually did have pneumonia but our model predicted as not having it .

however, from these matrix we can extract a diversity of metrics (f1 score , precision , recall ….) we will focus in this article particularly on accuracy and recall .

imagine we have these values given by our confusion matrix :

tp = 5 , fn = 85 , fp = 10 , tn = 990 , total = 1090

we’ve already calculated accuracy = 0.9 ,now let’s calculate recall with the given formula

wait what !!!!!!!!!!! we get a miserable 0.05 !

what a jump ! 🤯

so notice that there is a huge difference between these two metrics (accuracy and recall ) and that in the view of fact that accuracy is influenced more by the correctly classified instances and neglect the number of misclasified ones whereas recall puts a huge focus on the false negative value as shown in its formula above as much as the false negative decreases the overall value of the recall metric increases and this is what we are looking for to a large extent in medical diagnosis systems .

so in a nutshell , if you are more interested witn the positive class which is at predominantly the rarest one try to avoid using accuracy and go for either recall or f1 score which is the harmonic mean of precision and recall .

on condition that you are concerned by both classes and you don’t know what costs are behind low recall and low precision i suggest using the MCC metric (perhaps i will dedicate another post explaining it ) , here is its formula

its key advantage is that it lengthen its consideration to all four entries of the confusion matrix , whereas the f1 score ignores the count of True Negatives .

Nevertheless , for solving the imbalanced dataset problem i suggest oversampling which tries to balance the dataset by increasing the size of rare samples.

or Clustering the abundant class : An elegant approach was proposed by Sergey on Quora [2]. Instead of relying on random samples to cover the variety of the training samples, he suggests clustering the abundant class in r groups, with r being the number of cases in r. For each group, only the medoid (centre of cluster) is kept. The model is then trained with the rare class and the medoids only.

Finally , we reached the end of this article

i hope that this gave you a brief understanding of the case if you are a beginner , else if you are already on an advanced level feel free to critic or correct any of my insights …

all my thanks for your attention and reading

Have a good day 😊

References :

https://towardsdatascience.com/matthews-correlation-coefficient-when-to-use-it-and-when-to-avoid-it-310b3c923f7e

https://medium.com/analytics-vidhya/imbalanced-classification-for-medical-diagnosis-75dfcaa783d3

https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

--

--

Maru Tech 🐞
Data And Beyond

Deep learning & computer vision engineer | Algeria | Data And Beyond Author