Handling Imbalanced Datasets with Classifiers

Photo by Ibrahim Rifath on Unsplash

One of the secrets of successful living is found in the word balance, referring to the avoidance of harmful extremes.

James C. Dobson

Introduction

Getting a balanced dataset to train machine learning models continues to pose challenges. However there is no lack of methods and theories discussed among research communities to identify ways to effectively address such challenges. Data from the financial domain continue to be highly imbalanced with almost all of data belonging to genuine groups while training models for credit risk assessment. We are left with only a fraction or minuscule of data for the fraudulent group. This article discusses a couple of classifiers for dealing with the extremely imbalanced credit dataset.

We will start with importing libraries in python.

from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import where, mean
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

Now we will import the data.

pd.read_csv('creditcard.csv')

The majority of the samples in the credit card dataset are overrepresented target values belonging to non-fraud class in the classification of fraud detection.

We will now calculate the number of instances in fraudulent or genuine class.

X = np.asarray(df.iloc[:,:-1])
y = np.asarray(df[‘Class’])
counter = Counter(y)#plotting pie chart of distrbution
plt.pie([counter[0],counter[1]],labels=[‘fraudulent’,’genuine’])
pie chart representing the classes [fraudulent, genuine]

Algorithm level methods

In algorithm level methods, the classifier itself is made adaptable to the imbalance nature of the datasets. The feature weights in the classifier model are changed according to the number of instances in the feature vector. Support Vector Machines (SVM) and Decision Trees, the two most commonly used classifiers are adaptable to imbalance in datasets.

SVM

Before learning to handle the imbalance data with SVM, let us understand how SVM classifies data points.

SVM is a classification model that classifies the data points into two classes depending upon the hyperplane. The main focus on SVM is to find an accurate hyperplane that differentiates the classes efficiently. The accurate hyperplane is found by maximizing the margin in the hyperplane.

Let us see the accuracy of the SVM model with imbalance data.

model = SVC(gamma='scale')# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean Accuracy: %.3f' % mean(scores))

Handling imbalance with SVM

The margin of the hyperplane is controlled by a hyperparameter called weights in SVM. By default, all the features have equal weights. But when the distribution of classes is unequal, one possible approach to handle is to assign weights to the features according to their importance. This is done by explicitly defining the weights dictionary and passing it to the SVM model.

weights = {0:492, 1:284315}
model = SVC(gamma=’scale’, class_weight=weights)

In python, this can also be done directly by passing a parameter called weights=’balanced’ in the SVM model. Both the approaches will yield the same accuracy.

model = SVC(class_weight='balanced')# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean Accuracy: %.3f' % mean(scores))

If we compare the accuracy of the classifiers before and after handling imbalance data, we can see that the accuracy of the model is degraded by 1%. This is because SVM is not quite suitable for handling imbalances.

Decision Trees

A similar approach is followed while handling the imbalance nature of datasets with decision trees.

A decision tree is a classification model that classifies the data points by inferring decisions from the features at each step of training. Decision trees can be considered as the model of approximation from information gained from the variables. The decisions are inferred depending upon the gini impurity of the dataset. Gini impurity of the probability of the data points getting misclassified, which is controlled by the weight of the feature vector.

Let us see the accuracy of the model without handling the imbalance in data.

model = DecisionTreeClassifier()cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Handling imbalance with decision trees

The imbalance is handled by explicitly defining the weights of the features of the variables. One possible way is to interchange the ratio of data points of minority and majority classes.

weights = {0:492, 1:284315}
model = DecisionTreeClassifier(class_weight=weights)

In Python this is directly done by passing a parameter called weights=’balanced’ in the decision tree classifier model.

As we compare the accuracies of the decision tree model before and after handling of imbalance datasets, we can see that the accuracy is better after handling of imbalance data.

In this way, we can handle the imbalance using algorithmic methods!! The entire code for the algorithmic approaches can be found here.

Takeaways

We gained the beginners algorithmic approach as a means of handling imbalanced datasets. We implore the readers to explore the possibilities of devising a hybrid approach by combining the pros each may offer.

Do you have any questions?

Kindly ask your questions via email or comments and we will be happy to answer :)

--

--

Insights on Modern Computation
Perspectives on data science

A Communal initiative by Meghana Kshirsagar (BDS| Lero| UL, Ireland), Gauri Vaidya (Intern|BDS). Each concept is followed with sample datasets and Python codes.