Don't worry about Data Imbalance, balance your data using databalancer

Pradeep Vasudev
4 min readSep 30, 2021

--

Fixing imbalance is one of the major problems data scientists and machine learning practitioners are facing during their Machine learning classification model training. An imbalanced classification problem is an example of a classification problem where the distribution of examples across the known classes is skewed or biased. The distribution can vary from a slight bias to a severe imbalance where there is one example in the minority class for hundreds, thousands, or millions of examples in the majority class or classes. As a result, the predictive model trained on this imbalanced dataset will definitely show the skewness and bias during its inference time in the production environment. So removing the class imbalance from the dataset is the primary step that data scientists should focus on before the model building.

https://dev.to/bmor2552/binary-classification-problem-random-forest-onehot-encoder-34cg

The imbalanced dataset is the problem where data belonging to one class is significantly higher or lower than that belonging to other classes. Most ML/DL classification algorithms aren’t equipped to handle imbalanced classes and tend to get biased towards majority classes.

There are lots of tools and techniques for handling the imbalanced dataset issue. If you want to explore the different methods, please check this article. But unfortunately, all these techniques are not enough when we are dealing with text classification problems.

Cool… Don't worry🥳. Now you have the new library databalancer. It will help you for resolving the class imbalance issue in the text classification problems.

databalancer

Databalancer is the python library using in machine learning applications to balance the imbalanced text classification datasets before the model training.

databalancer logo

Features

  • Databalancer is able to balance any imbalanced text classification datasets
  • If the given dataset is imbalanced then while balancing no existing data is removed, but new data will be generated and added to the dataset
  • For a particular class, the newly generated data will be the paraphrases of the existing data in that particular class
  • By default, these paraphrases are generated using the ramsrigouthamg/t5_paraphraser model (You can read more about the model from Huggingface official documentation)
  • Databalancer also provides another method called classCountVisualization to show the dataset class count distribution

Installation

Install the databalancer package with pip. Check the pypi documentation for more details.

pip install databalancer

Compatibility

Databalancer is only compatible with python 3.6 or above.

Quick Start

The library databalancer provides two different functionalities.

1 — classCountVisualization

2 — balanceDataset

Dataset

Consider the text classification dataset with 5 classes. Please check this link to access the sample dataset.

First five lines of the dataset

classCountVisualization

#Import the classCountVisualization from the 'databalancer' module
from databalancer import classCountVisualization

#Pass the required datasetname(here traindata.csv) to the function
classCountVisualization("dataset.csv")
imbalanced dataset class distribution

balanceDataset

#Import the balanceDataset from the 'databalancer' module
from databalancer import balanceDataset

#Pass the dataset name which is to be balanced(here traindata.csv) to the balanceDataset function
balanceDataset("dataset.csv")

The above code will balance the dataset and store the balanced dataset(‘balanced_data.csv’) in the local machine.

To show the balanced dataset class count distribution, run the code below.

from databalancer import classCountVisualization

classCountVisualization("balanced_data.csv")
balanced dataset class distribution

Please go through this colab notebook to see the whole code.

Summary

So that's all about the new tool. Thanks Ramsri Goutham, the developer of the t5_paraphraser model, which acts as the backbone of databalancer. In the coming versions of the library, I will try to integrate more features like speed improvement, Multi-label dataset balance, Multi-language support..etc.

So buddy, don’t be late, let's balance…🤗

Thanks for reading. Cheers!!😊

If you enjoyed reading this article, give it a like and a share. Follow me for more!

--

--