Using Under-Sampling Techniques for Extremely Imbalanced Data

Published in

Dataman in AI

9 min readAug 10, 2018

The issue of class imbalance can result in a serious bias towards the majority class, reducing the classification performance and increasing the number of false negatives. How can we alleviate the issue? The most commonly used techniques are data resampling either under-sampling the majority of the class, or oversampling the minority class, or a mix of both. This will result in improved classification performance. In this article, I will explain what is imbalanced data, why ROC fails to measure correctly, and the techniques to attack the issue. You are highly recommended to read the second article “Using Over-Sampling Techniques for Extremely Imbalanced Data”. In both articles, I include the Python code for those who are interested. To access the code in Python Notebook, you can click here.

What is imbalanced data?

The definition of imbalanced data is straightforward. A dataset is imbalanced if at least one of the classes constitutes only a very small minority. Imbalanced data prevail in banking, insurance, engineering, and many other fields. It is common in fraud detection that the imbalance is on the order of 100 to 1.

Using Under-Sampling Techniques for Extremely Imbalanced Data

What is imbalanced data?

Written by Chris Kuo/Dr. Dataman