Handling Imbalanced Datasets With imblearn Library

Published in

TheCyPhy

4 min readOct 10, 2020

In machine learning problems, we often encounter with imbalanced datasets. Problems like fraud detection, claim prediction, churn prediction, anomaly detection, and outlier detection are the examples of classification problem which often consist of the imbalanced dataset.

In this article, I am going to discuss a simple approach to deal with an imbalanced dataset by using imblearn python library which is specially designed to deal with imbalanced datasets. The dataset is which I am using here is taken from Machinehack Detecting Anomalies in Wafer Manufacturing hackathon which consists of binary classes.

Outline of the article :

What is Imbalanced dataset
Imblearn Library
Dealing with Imbalanced dataset
Balanced Random Forest
Conclusion

1. What is Imbalanced Dataset:

A dataset that consists of one class is in the majority and if the class has above 90% weightage and another class has less than 10% weightage then the dataset is imbalanced. The imbalanced dataset is highly biased towards one class, it creates a problem to train our machine learning model. Machine learning is not able to identify minority class correctly.

The Wafer Manufacturing dataset consists only two classes: class 0 if the label is not an anomaly and class 1 if the label for the row is an anomaly.

2. Imblearn Library :

Imblearn library is specifically designed to deal with imbalanced datasets. It provides various methods like undersampling, oversampling, and SMOTE to handle and removing the imbalance from the dataset. This library consists of various ensemble methods like bagging classifiers, random forest and boosting classifiers that can be used to train models for imbalanced data sets with very efficient accuracy.

3. Dealing with Imbalanced dataset :

For building a good machine learning model it is necessary that we have enough sample points for each class. To do that we can use undersampling, oversampling, or SMOTE according to our problem and dataset requirements.

3.1 Undersampling :

Undersampling is one of the techniques which is designed to handle the imbalanced dataset and make it balanced. This technique eliminates or deletes the data points of the majority class to make an equal ratio of major and minor classes. There is a various method for undersampling like Tomek’s links, EditedNearestNeighbours, CondensedNearestNeighbour, InstanceHardnessThreshold, etc. These methods can be implemented using imblearn library.

3.2 Oversampling :

The Oversampling technique is quite helpful for handling the imbalanced dataset. This technique redistributes randomly the minority class data points to increase the ratio in the dataset. It duplicates the existing data points of minority class and equalizes the ratio of majority and minority class.

3.3 SMOTE :

SMOTE refers to Synthetic Minority Oversampling Technique. This is also an oversampling technique and widely used to handle the imbalanced dataset. SMOTE selects the data points of the minority class in feature space to draw a line between those points and generate new points along with the line. Thus this technique synthesizes new data points for minority class and oversample that class. This is the most effective method for oversampling.

4. Balanced Random Forest :

Balanced random forest is a very effective ensembling algorithm to deal with the imbalanced dataset. This algorithm works in the following manner:

i. For each iteration in the random forest algorithm, it takes a bootstrap sample from the minority class and the same number of data points form the majority class with replacements.

ii. Induce a classification tree using the CART algorithm to its maximum size without pruning and at each node search form randomly selected variables for the optimal split.

iii. Repeat these two steps above for the desired number of times and aggregate the results to make the final prediction.

Before implementing this algorithm I split the dataset into 10 splits using StratifiedKFold and then apply BalancedRandomForestClassifier.

In the image below we can the auc_roc score for different folds after applying the balanced random forest classifier.

5. Conclusion :

In this article, I discussed an imbalanced dataset and how we can handle them. There are various libraries available to tackle it. Imblearn library is one of them and a very vast feature enrich library, although here I touch only a few topics that are most commonly used and you can try for your own machine learning problem. For more information, I am providing some links to get in-depth intuition of some of the above-mentioned topics.

References :

https://imbalanced-learn.readthedocs.io/en/stable

https://rikunert.com/SMOTE_explained

https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf