5 Methods To Deal With Imbalanced Data In Machine Learning

Sankalp Shrivastava
5 min readOct 7, 2022

--

In machine learning practices, it is very common to face imbalanced datasets while doing a classification task. An imbalanced dataset is a dataset where one of the target class labels’ is significantly lower or higher than other class labels. Your machine-learning model can easily miss lead to wrong results with such kinds of datasets. Let's understand this with an example; Assume you are working on a fraud detection model, where you need to find out if one transaction is fraudulent or not, based on the previous records of fraudulent transactions. Now, you got a dataset with 10 thousand samples, in which 500 transactions were fraudulent and others were not. In such a scenario, your machine-learning model will get biased and simply predicts all the samples as non-fraudulent and still gives an amazing accuracy of 95 percent.

Let's take another scenario, this time you are in the medical field and working on a model which will predict if a patient has some rare disease or not, based on his symptoms. You get the records of the previous patients who were tested for this disease by which you will train your model. you examined the data and found out there were 5000 overall samples in which only 50 people got the rare disease. If you feed this data to your machine-learning model, then your model will predict all the samples as no-disease with an accuracy of 99 percent.

But both of these models will fail miserably in the test environment. Hence, it is essential to balance such kind of data before feeding it to the machine learning model. Also, it is a good practice to use evolution metrics like precision, recall, and F1-score in place of accuracy to know the real nature of the ML model.

Now how to deal with such kind of data? Before jumping into this, let's understand two terminologies which are majority and minority classes.

Majority Class

The majority class is simply the class that got the higher number of samples in the dataset. In the above two examples, the ‘no-fraud’ and ‘no-disease’ classes were the majority classes of those datasets.

Minority Class

The minority class is the class that got fewer samples in the dataset. In the above two examples, the ‘fraud’ and ‘disease’ classes were the minority classes of those datasets.

ML models are biased toward the majority class when working with an imbalanced classification problem, but the minority class is the class that is of the most interest. Like the ‘disease’ and ‘fraud’ classes of previous examples.

Now that you know what are these terminologies, let's work on how to deal with them. The five techniques which we are going to learn about are:

  • Under-Sampling the Majority Class
  • Over-Sampling the Minority Class
  • Over-Sampling the minority Class using SMOTE
  • Ensemble Method
  • Focal Loss

Let’s go through all of them one by one. Trust me they are easier than you think.

Under Sampling the Majority Class

In this method, we make our majority class equal to the minority class. For this, we randomly delete some samples from the majority class to match them with the minority class. So that we can get a balanced dataset of the majority and the minority class. This method is called undersampling.

Under Sampling

Definitely, this is not the best approach to balance the dataset, due to the loss of so much data. Let’s look at the counterpart of this approach which is called over-sampling.

Over-Sampling the Minority Class

In the Simple Over-Sampling method, we use duplication to make the minority class equal to the majority class. We randomly duplicate the minority class samples many times to equalize them with the majority class. This method is called oversampling.

Over Sampling

There is an improved version of over-sampling, which is called SMOTE. Let’s look into it.

Over-Sampling the minority Class using SMOTE

The most common technique used for oversampling the minority class is called SMOTE. SMOTE stands for Synthetic Minority OverSampling Technique. In a simple Over-sampling technique duplication of samples doesn't add any new amount of information for the minority class. SMOTE creates synthetic samples of the minority class for oversampling instead of just duplicating them. SMOTE uses the k-nearest-neighbors algorithm to pick a sample and add some random nearest neighbors of it in the oversampled data.

SMOTE

There is a python module named imblearn that can be used for SMOTE over-sampling. Another approach to deal with imbalanced data is the Ensemble method. Let’s look into this.

Ensemble Method

In this method, we use the same minority dataset to train with different subsets of the majority class. In each sub-training model the subset of the majority class and the minority class held in the balance. After that, we run these models on test data and perform voting among those sub-training models. And declare the actual label as the label which got the maximum votes.

Ensemble Method

For better understanding let’s run through an example. Assume you have 3000 samples in your majority class and 1000 samples in your minority class. Now we train 3 models with a random subset of the majority class and the minority class. Make attention that the in each model majority and minority classes are in balance. After training the model we perform the voting among these models. And the majority vote we choose for our actual label. This method is very similar to the Random Forest algorithm, where we take the majority of decision trees. Let’s talk about our last technique for the day which is Focul Loss.

Focal Loss

Focal loss is mostly used in the object detection class. It is very useful for training imbalanced datasets for such scenarios. Focal loss doesn't manipulate the data but it penalizes the majority samples during loss calculation in model training and gives more weight to the minority class samples.

Here is a great article by Yash Marathe on Focul loss. If you want to dive deep into it, you can read it here.

This is it for this article. Now you know what is data imbalance and the methods for dealing with it.

Happy Learning.

--

--