Class Imbalance problems in Deep Learning

Ijaz Khan
unpack
Published in
3 min readNov 9, 2020

In the data science world, when we have to deal with models training stuff, they face severe issues with data cleaning. Data cleaning takes almost 90 percent of our time, and after everything is set and the model is ready for training, then we find out that the results are not satisfactory. These unsatisfactory results make us curious about what's going on and we want to explore every minor detail to make our model perfect. One such problem we face most of the time is the dataset we are dealing with is not balance. In other words, we can say there is a class imbalance problem in the data, which means that we have more data for a few classes and less data for others, which can make the prediction biased for those classes whose data is more than the others. There are two main methods to handle class imbalance problems i.e.,

1.Data-Level methods
2.Algorithm-Level methods

Data-Level methods:

This method involves working on the dataset to remove the imbalance so the classifier can show us better results on class imbalance data as well. This method includes Under-Sampling and Over-Sampling techniques. During the Under-Sampling method, the instances of the class, which is in majority in the whole dataset are removed to maintain the balance, while Over-Sampling is the opposite in a way that it replicates the instances of the minority class either randomly or a specific custom algorithm.

Many studies have proved that Over-Sampling proved to be a better way in comparison with Under-Sampling.

Algorithm-Level methods:
The subcategories of Algorithm-Level methods are Hybrid and cost-sensitive methods.
Cost-sensitive methods work in a way by assigning more weight to the learner or the instance when the event of misclassification happens. For example, a false positive may be assigned with a lower weight as compared to the false negative, if the latter is the class or label of interest.
Hybrid methods are designed to handle the problems, which arise from data-sampling methods, cost-sensitive methods, basic learning algorithms i.e, Naive Bayes, or feature selection methods.

In some cases, the sub-groups of Algorithm -Level methods and Data-Level-Methods are combined to solve the class imbalance problem.

There are many resources available to understand the class imbalance problems and the techniques to solve those problems. Below are some links which are useful for a quick start.

  1. https://www.kaggle.com/tanlikesmath/oversampling-mnist-with-fastai#Creating-imbalanced-dataset
    This tutorial is based on Fastai V1, which solves the class imbalance problem using the Imbalanced MNIST dataset. However, Fastai V2 is totally independent of Fatsai V1, so it will need additional homework to convert it to Fastai V2.
  2. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-018-0151-6
    This is a survey paper “A survey on addressing high-class imbalance in big data”. It's a very detailed study about the class imbalance problem and I recommend it for understanding the problem clearly.

Hope you guys will find the article helpful and appreciate by claps. Peace :)

--

--