DATA SCIENCE THEORY | CLASS IMBALANCE | KNIME ANALYTICS PLATFORM

99% Accuracy! Great…right?

Creating Better Models by Dealing with Class Imbalance in KNIME

Tosin Adekanye
Low Code for Data Science

--

Accuracy Can be Misleading

You’ve tested several models for your fraud detection algorithm. You choose a model which brags an accuracy level of 99%. With great enthusiasm, you send your project off to the Senior Data Scientist, but she is unimpressed.

You have 500,000 transaction in your dataset. 495,000 of your transactions are normal, while 5,000 are fraudulent. Now, assuming your model classifies 499,915 transactions as normal, you would achieve 99% accuracy, while missing out on the majority of your 5,000 fraud cases.

In cases of class imbalance, accuracy is not a good measure of model performance if the class imbalance has not been addressed. In such case sensitivity and specificity, are more reliable measures of a model’s performance.

Confusion Matrix of our super dramatic and fictitious model.

Sensitivity evaluates how much of the true cases were correctly identified by the model by dividing the number of predicted true cases by the number of actually true cases in your dataset.

Sensitivity = predicted true cases/actually true cases

Since we are trying to detect fraud, true(1) would equate to a fraudulent transaction while false(0) equates to a normal transaction.

Sensitivity in the image above = 35/(5000) or TP/(TP+FN) = 0.7%

Despite the high accuracy of this model, it is very bad at detecting fraud cases!

Specificity evaluates how much of the false cases were correctly identified by your model by dividing the number of predicted false cases by the number of actually false cases in your dataset.

Specificity = predicted false cases/actually false cases

Specificity in the confusion matrix above = 494,950/495,000 or TN/(TN+FP) = 99.99%

The specificity is high, but this is no cause for rejoicing. Since the main purpose of a fraud detection model is to predict fraud cases so that fraudsters can be stopped in their tracks, a high sensitivity percent is critical here. This is not so say that specificity should be ignored either as we do not want to frustrate customers by misclassifying normal transactions as fraudulent.

For more information on interpreting the results of a classification model, please check out this article.

Why is a Balanced Dataset Important?

In cases of imbalanced classes, a model does not have enough information to train appropriately. This usually leads to less predictive ability for the minority class. In the case of our fraud detection example, 99% of our transactions are normal, while just 1% are fraudulent. Such massive class imbalance can make it harder for your model to correctly recognize and predict cases in the minority class.

There are three main approaches to dealing with class imbalance:

  1. Oversampling: Increasing the number of cases for the minority class.
  2. Undersampling: Reducing the number of cases for the majority class.
  3. Getting more real data of the undersampled class, where possible.

I’d only be covering the first two methods in this article, as the third is not possible with the example dataset leveraged.

Handling Class Imbalance in KNIME

Now that you have some background on measures of model performance, let’s head over to KNIME and look at all of these ideas in action.

Using an anonymized credit card transactions dataset found on Kaggle, we can build a model to detect if a transaction is fraud. For this example, I am using only half the number of normal transactions provided in the dataset so that I can meet the upload size limit for the KNIME Hub.

Despite using only half the number of normal transactions, the dataset is still massively imbalanced. There are 142,157 normal transactions and 492 fraudulent transactions in the dataset.

I’m going to be showing you three different ways to handle class imbalance in KNIME. The first method employs undersampling leveraging the Equal Size Sampling node.

In the configuration dialogue, you indicate your class column and KNIME reduces the number of rows in your majority class to match the number in your minority class. You can enable static seed to draw the same sample each time you ran the node.

For oversampling, you can either use the SMOTE node to create synthetic data for your minority class, or you could duplicate rows in your minority class.

The SMOTE node uses k-nearest neighbours (k-NN) to create new data points from your existing data.

In the configuration dialogue, you specify your class column, how many nearest neighbours to use, and you can select the option to oversample the minority class to apply oversampling methodology.

Another approach to oversampling is by duplicating the fraud cases. This can be achieved a number of ways but here is the approach I’d take:

Essentially, I am coping and pasting the data for fraudulent cases over and over again…291 times! This actually runs remarkably fast and this process makes the cases of fraud more equal to the normal cases, resulting in a more balanced dataset.

Evaluating the Impact of Treating Class Imbalance

To test how balancing the data affected my model performance, I utilized different techniques of dealing with imbalance and compared the results of these to each other, and to the untreated dataset.

I did not apply class balancing on the test set, as credit card transactions are massively imbalanced in actuality. Thankfully, fraud cases are rare! I wanted to preserve this natural imbalance when testing the model.

It was important however to apply class balancing to the training data so that the model could train appropriately. You can download and explore the workflow on the KNIME Hub here.

In all cases where class balancing was employed, the XGBoost model was able to detect more fraud cases compared to when the class imbalance was undealt with. Note that I utilized the exact same test data for all cases, to keep things equivalent. The test data has 102 fraud transactions and 28,428 normal transactions.

Reading the confusion matrix

In the confusion matrixes, the rows represent the actual class and the columns represent the predicted class.

In the table above, the cell with row column combination of 1,1 represents True Positives (TP), this is the number of fraudulent cases correctly identified by the model.

The cell with row column combination of 0,1 = False Positives (FP), that is normal transactions incorrectly classified as fraud by the model. These are false alarms.

The cell with row column combination of 1,0 = False Negatives (FN), that is fraudulent transactions incorrectly classifies as normal by the model. These are classification misses.

The cell with row column combination of 0,0 = True Negatives (TN) normal transactions correctly classifies as normal by the model.

Without addressing the class imbalance:

Sensitivity =83/102 = 81.37%

Specificity = 28425/28428 = 99.99%

I am using the formidable XGBoost model, so without addressing class imbalance, we are already capturing the majority of fraud cases, and recognizing almost all normal transactions. But can we do better?

With undersampling:

Sensitivity =94/102 = 92.16%

Specificity = 27252/28428 = 95.86%

With undersampling, the model is able to catch even more fraud cases, but we now have a lot more normal transactions classified as fraudulent.

In this situation, a False Negative (predicting no fraud when there is fraud) might be worse than the a False Positive (predicting fraud when there is no fraud). However, would we want to stress a good number of customers over a benign transaction? 🤔

With oversampling by duplication:

Sensitivity =87/102 = 85.29%

Specificity = 28422/28428 = 99.98%

Compared to no class imbalance handling, here we are able to catch about 4% more fraud cases, without significant misclassification of normal transactions.

With oversampling by creation:

Sensitivity =90/102 = 88.24%

Specificity = 28414/28428 = 99.95%

With SMOTE oversampling we catch approximately 7% more fraud cases than if we do not handle class imbalance, and we do not misclassify a high number of normal transactions as fraud.

So which method should we go with?

It’s always a trade off on which metric is more important for your use case, and how much misclassification of false negatives, and false positives you are willing to tolerate?

Comparing the results from the undersampling method to the SMOTE method, the Financial Institution(FI) would have to consider if it is willing to tolerate 4% misclassified normal transactions, so that it can capture 4% more fraud transactions.

If there is a system in place to quickly deal with an incorrect normal transaction classification, tolerating more false positives might be feasible. For example, when using my favorite card, my payment sometimes gets rejected and I immediately receive a notification/text asking if I was the one who performed the transaction.

Once I confirm the transaction, I am able to complete the transaction. As a customer, I am unbothered by this. With such a system an FI could go for the undersampling method which likes to classify things as fraud a bit more. But a more conservative one might prefer the outcome of the SMOTE method.

Final words

Dealing with class imbalance is just one of many steps to building better models. Other steps involve thorough data cleaning and understanding, feature engineering, feature selection, cross validation, testing multiple models, etc. Moreover the real test of model performance is how it does in production. Your test data indices are simply the baseline 😉

And that’s how you deal with class imbalance in KNIME!

--

--