Looking Beyond Accuracy — A Holistic Evaluation Guide for Binary Classification Models in Machine Learning

Published in

The Startup

7 min readNov 19, 2020

Data science and ML practitioners constantly work on classification problems across industries and applications. In many cases accuracy is not the best metric to judge a model. For example, a class imbalanced dataset might be more accurate on training data than on unseen/new data. There are tonnes of scenarios in which there is need for more metrics to judge and compare various models. Therefore it is wise to be aware of all-seasons (by all-seasons I mean any dataset) metrics to thoroughly analyze models.

Accuracy alone may or may not be the best parameter to select a model that will be used on unseen or new data. A well calibrated model with better AUC (even if lower accuracy) will definitely perform better on unseen data.

AUC, short for area under the curve, the ROC curve. [ Receiver Operating Characteristic ]

Imagine a simple scenario to understand ROC.

Objects are passed to you one by one and you, the receiver, has to identify whether the object is a book or a notebook. A simple classification. But the catch is that you are blindfolded. You can feel, touch, smell, analyze shape and thickness, etc to predict if the object you received is a book or a notebook. Remember we log each of your prediction and actual object received for scoring and evaluation.

After several attempts, though blindfolded, you would have developed an intuition and be easily characterizing the object you now receive as book or notebook. The probability of correctly characterizing the object by you now is known as the receiver operating characteristic.

Understand AUC-ROC as the ability of the model to distinguish between classes.

Before: Using data as it is | After: Normalization, transformation, binning, removing multi-collinearity from the data, tuning the model hyper-parameters and calibrating the model.

ROC is a curve between true positive rate and false positive rate. ROC of 0.5 is as useless as guessing without knowing anything about the object passed to you. Just like guessing head or tail in a coin toss.

Greater area under the ROC curve indicates greater capacity of the model to distinguish between the two classes i.e. better the model is at predicting YESs as YESs and NOs as NOs.

AUC score can also be understood as a consolidated score of the confusion matrix, higher the better [0,1]

Why AUC-ROC of 0.5 is useless?

Binary classification is not so tricky. The answer is either yes or no; 1 or 0; A or B; etc. So by default each of the class has equal probability of occurring. Therefore the effort is put to increase it, as close to 1 as possible. 1 being absolute certainty.

Therefore, value of 0.5 is default for discrimination threshold. Just as its literal meaning, threshold means exactly same in machine learning as well. “The level at which something starts to happen.”

It can be adjusted to increase or decrease the sensitivity to false positives.

What does this mean?

Let us say you are a loan officer in a multi national bank. Your bank is offering SME loans up to INR 50 lakhs to young entrepreneurs under the Make in India benefits scheme.

For the same a website is setup which acts as a landing page for anyone who clicks on the banner ads. In the website there are various fields that one can fill and expect to receive a call from loan officer if eligible. The landing page is equipped with machine learning algorithm for risk classification which is trained on organizations historical data and other third party credit scoring and delinquency data. When the user enters their data, the algorithm sends risk analysis report also to the loan officer along with applicant details.

Because the default threshold of any classification model is 0.5, the loan officer only gets applicants that crosses this threshold. The loan officer can request the data science team to lower the threshold to 0.35 so that he/she can attend to marginal applicants in detail. Similarly, the threshold can be increased for specific uses example in case of excess of applicants.

Feature importance can assist loan officers to understand why and how did machine learning algorithm reach a particular decision. What were the top parameters this model considered for decision making.

Every model is trained on some pre-existing historical data. It can be organization’s own data or data acquired through a third party or both. Many a times during initial analysis of data we see a class-imbalance i.e. Yess or 1s can be way more than Nos and 0s and vice versa. In such cases, ML model may become partial in favor of one class. How to evaluate model when such dataset is present?

One way is to synthetically put new values in the dataset using SMOTE technique. You can read more about SMOTE balancing here.

Another way is to go ahead and train the model with the imbalanced data and use Precision-Recall curve to evaluate your model.

High precision means low false positive and high recall means low false negative. A system with high precision and high recall will return all correct results/labels in high numbers.

Apart from taking care of imbalanced data there are few other things one needs to attend to before putting the data set for training. These include but are not limited to, normalizing the data, categorizing/binning, transforming, etc. One another pre-training step is to identify columns in dataset that move together or are dependent on each other. In more familiar words, removing multi-collinearity to boost performance. However, there may be dependencies and collinearity that may still exist in the model. Recursive feature elimination (RFE) attempts to eliminate collinearity in the model.

RFE plots the number of features in the model with cross-validated test score and variability alongside the selected number of features. Recursive Feature Selection with Cross Validation or RFECV curve tells us the number of features that are just enough to reach maximum accuracy in binary classification. All remaining features can be safely discarded without affecting the model accuracy.

Why is this important?

For 2 reasons:

For more intuitive models — All final features and model parameters play a role in decision making giving a holistic approach to decision making for unseen datasets. It also allows you to see factors, if any, that were considered important but statistically aren’t.
For better interpretation — ML applications can not be black box. Few features mean easier understanding of decision making by the algorithm. Decision boundary charts helps visualize that.

Learning Curve based graphs can be used to see how many iterations your model is taking to reach the top score. Lower the better. They show change in learning performance which can be used to diagnose fitting and identify relevance of data to problem at hand.

When predicting classes, it is imperative to print the probability related to the label. Anything above 0.5 is YES or 1. But by how much. So generally the output is like 1, 0.75, meaning class 1 with a probability of 0.75. This gives sort of confidence on the prediction.

But how can you trust probabilities?

By plotting calibration curves. Also know as the reliability curve, a well calibrated model will be straight slope at 45 degrees from origin.

Similarly, validation curve shows variation between training score and cross validation score. A good model will have less variation between the two scores.

Lastly, gain chart and lift chart measures the effectiveness of classification model calculated as the ratio between the results obtained with and without the model applied. Obviously, a good model will be higher than 0.5 and as close to 1 as possible.

Conclusion

Having a holistic approach towards data preparation and model training will lead to predictions with high accuracy and precision. The ability to continuously generate positive results in high numbers is the goal of any classification model.

Almost all of the metrics can be visualized. A great model will also have beautiful visuals that will be soothing to the eyes and may or may not follow some pattern or shapes.

One last thing, depending upon your application of classification, make sure to assign cost of false positives, true positives, false negatives and true negatives to minimize negative impact of misclassification. This will help in adjusting the threshold in order to mitigate risk. For example, cost of false negative in a banking environment may vary greatly as compared to a clinical environment.

And of course, these metrics, if not all, can also be applied to assess models for multi-class classification.

GITHUB: https://github.com/rajatscibi/bfsi_loan

#machinelearning #datascience #statistics

Looking Beyond Accuracy — A Holistic Evaluation Guide for Binary Classification Models in Machine Learning

Conclusion

Written by Hands on AI