Is Accuracy a reliable measure of model performance for an imbalanced dataset?
The most common evaluation metric used for measuring the performance of a classification model is Accuracy. While Accuracy has proved to be an important metric, it hasn’t been reliable enough when it comes to working on imbalance datasets. In this article, I will explain why Accuracy should not be considered as a primary metric for evaluating the performance of a model on imbalanced data.
Generally in practical scenarios, the classification datasets we deal with are imbalanced, i.e. there is a huge difference in volume of events and non-events. These appear in scenarios like Spam Detection, Fraud Detection, Churn Detection, to name a few. As Accuracy looks at the total predictions that are made correct from the entire population, it does not focus on the minority class. The majority class drives the Accuracy and ultimately gives a high value which gives an impression that our model is working great.
Suppose a dataset contains 1000 samples out of which 140 are fraud customers and 860 are genuine customers. Let’s say, we got below confusion matrix from our initial prediction on this dataset:
Below is the interpretation of confusion matrix:
Accuracy is defined as Total Correct Predictions / Total Predictions
Above matrix gives an Accuracy of 84%. Sounds Great right?
But if we look into the matrix further, we are able to detect only 29% fraud cases correctly and 71% customers are incorrectly predicted as non-fraud.
Let’s see what other evaluation metrices would work better.
Precision: Out of all positive predictions made, how many were actually true. This is very useful for business when focusing the model’s accuracy in classifying samples as positive i.e. the “fraud customers” in above example.
Precision= TP / (Predicted Positives)=TP / (TP+ FP)
Above confusion matrix gives a Precision of 40% which indicates that 40% of all predicted fraud customers were actually fraud customers while 60% were actually non fraud.
Recall: Out of all positive samples, how many were predicted correctly.
Recall= TP / (Actual Positives)=TP / (TP+ FN)
A recall of 29% indicates that the model is able to correctly predict only 29% of all fraud customers. The model was not able to detect the remaining 71% of actual frauds.
Now, let’s look at two scenarios-
What if I don’t want to develop a model and rather assign all predictions as non fraud customers for above dataset i.e. Y=0? The matrix would look like this-
What if I don’t want to develop a model and rather assign all predictions as fraud customers for above dataset i.e. Y=1? The matrix would look like this-
An efficient model should have Precision and Recall values somewhere between the above two scenarios that can beat Y=0 , Y=1 and Random Chance.
This can be done by optimizing the threshold set for classifying events and non-events. After decreasing threshold of the predictions on above dataset, we are getting below values-
As you can see, Accuracy is still the same, however, there is an improvement in the Precision and Recall. Above matrix gives us a precision of 42% which indicates that 42% of all predicted fraud customers were correct. Also, the model is now able to correctly predict 36% of all fraud customers. Definitely an improvement!
Above confusion matrix is just a representation and we should perform similar iterations on the threshold to arrive at a balanced Precision & Recall.
We should also carry out under-sampling and over-sampling techniques to deal with imbalanced data. Multiple algorithms can be considered to arrive at the best suitable outcome.
I hope this article was useful for you.