Tips for handling Class Imbalance Problem

Gowthami Wudaru
School of ML
Published in
3 min readAug 30, 2020
Class Imbalance

An imbalanced classification problem is an example of a classification problem where the distribution of examples across the known classes is biased or skewed. The distribution can vary from a slight bias to a severe imbalance where there is one example in the minority class for hundreds, thousands, or millions of examples in the majority class or classes.

Imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class. This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

This problem is predominant in scenarios where anomaly detection is crucial like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, Claim Prediction,Churn Prediction,Spam Detection,Outlier Detection,Intrusion Detection,etc. These types of problems can generally be classified as Rare event prediction, Extreme event prediction,Severe class imbalance.

We define majority class as the class (or classes) in an imbalanced classification predictive modeling problem that has many examples and minority class as the class in an imbalanced classification predictive modeling problem that has few examples.

We are going to discuss about two ways to deal with Class Imbalance Problem.

Weighted Loss:

Deep Learning has been employed to prognostic and health management of automotive and aerospace with promising results. Literature in this area has revealed that most contribution using deep learning is largely focused on the model’s architecture. However, contributions regarding the improvement of different aspects in deep learning, such as custom loss function for prognostic and health management are scarce. There is, therefore, an opportunity to improve upon the effectiveness of deep learning for system’s prognostics and diagnostics without modifying the models’ architectures. To address this gap, the use of a different weighted loss functions are being investigated.

In this method, we are going to change the loss function(to ensure “equity”). Binary cross-entropy loss function is as follows

L(X,y)= -log(P(Y=1|X)) if y=1

-log(P(Y=0|X)) if y=0

If we change it so that

L`(X,y)= -w1*log(P(Y=1|X)) if y=1

-w2*log(P(Y=0|X)) if y=0

where w1=(class0 cases)/(total cases) and w2=(class1 cases)/(total cases)

class0 and class1 cases represent the majority and minority cases respectively.

We can check this with the following example.

Suppose there are 8 records, each with a probability of 0.5(random guessing), of which 6 belong to class0(majority class). When we predict the loss, each record has a loss of 0.3. Thus, in final loss.

L(due to class0)=0.3*6=1.8

L(due to class1)=0.3*2=0.6

As we are trying to correctly predict class1 cases, we can see that it will be problematic as error due to class1 is much smaller than error due to class0.

According to the formulae, w1=2/8,w2=6/8

L`(due to class0)=2/8*0.3*6=0.45

L`(due to class1)=6/8*0.3*2=0.45

The error from both the classes is same for random guessing

Re-sampling:

The idea is as follows: Suppose we have a data set with 10 rows, out of which 7(1,2,3,5,6,8,10) belong to majority class and 3(4,7,9) to minority class. We can resample data by taking 5 from majority class(suppose 3,6,1,8,2) and 5 from minority class(suppose 4,7,9,7,4). By doing this, we are losing some majority class data and have duplicates of minority class.

  1. You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or
  2. You can delete instances from the over-represented class, called under-sampling.
Under and Over sampling
Re-sampling

Some disadvantages are: It can discard potentially useful information which could be important for building rule classifiers.The sample chosen by random under sampling may be a biased sample and it will not be an accurate representative of the population. Thereby, resulting in inaccurate results with the actual test data set. It increases the likelihood of over-fitting since it replicates the minority class events.

--

--