Unbalanced Datasets & What To Do About Them

German Lahera
Strands Tech Corner
8 min readJan 22, 2019

--

Unbalanced datasets are prevalent in a multitude of fields and sectors, and of course, this includes financial services. From fraud to non-performing loans, data scientists come across them in many contexts. The challenge appears when machine learning algorithms try to identify these rare cases in rather big datasets. Due to the disparity of classes in the variables, the algorithm tends to categorize into the class with more instances, the majority class, while at the same time giving the false sense of a highly accurate model. Both the inability to predict rare events, the minority class, and the misleading accuracy detracts from the predictive models we build.

The class imbalance problem between the majority and minority is frustrating, but not unexpected. We will now discuss the main techniques and methods available when dealing with this type of data. At the end of this post, you will find common libraries and packages from Python and R used to resolve this issue.

What does an Unbalanced Dataset Mean?

In simple terms, an unbalanced dataset is one in which the target variable has more observations in one specific class than…

--

--