Class Imbalance Problem and Ways To Handle It.

Sandhya Krishnan
Nerd For Tech
Published in
6 min readSep 26, 2021

Classification in machine learning refers to a predictive modeling problem where a class label is predicted for a given input data. The label or target may belong to two classes or more than two classes.

Class imbalance occurs when most of the data belong to one class label. It can occur for both two-class classification and multiclass classification. Machine learning algorithms assume data is equally distributed. So, when we have a class imbalance, the machine learning classifier tends to be more biased towards the majority class, causing bad classification of the minority class. It occurs because the conventional machine learning algorithm cost function constantly attempts to optimize quantities such as error rate without considering the data distribution.

Check out the python code to analyze the class imbalance dataset here.

How much class is imbalanced?

Two factors balance_accuracy_score and accuracy_score are to be considered to know how much the class is imbalanced.

balanced_accuracy_score computes balanced accuracy and it is used to deal with an imbalanced dataset for both binary or multiclass classification problems. It is defined as the average of recall obtained in each class.

Whereas the accuracy score is the accuracy classification score. The difference between these two accuracy scores will be zero for the balanced dataset. That is for a balanced dataset the balanced_accuracy_score becomes equivalent to the accuracy_score.

Ways to handle Imbalanced Class

1. Changing Performance Metric :

For an imbalanced dataset, the machine learning model will predict the value of the majority class for all predictions and achieve a high classification accuracy, even though it will be a bad classifier for the minority class. This is called the accuracy paradox.

To overcome this, other performance metrics are to be considered for evaluation like confusion matrics, precision, recall, F1 score, and Area Under ROC Curve.

The confusion matrix is used for summarizing the performance of a classification algorithm. It contains :

  • True Positive: Positive outcome is correctly predicted as positive.
  • True Negative: Negative outcome is correctly predicted as negative.
  • False Positive: Negative outcome is wrongly predicted as positive.
  • False Negative: Positive outcome is wrongly predicted as negative.
Confusion Matrix

Type I error (false positive) also known as an error of the first kind is the mistaken rejection of a null hypothesis as the result of a test procedure. That is when a negative result is wrongly predicted as positive. Type II error also known as the error of the second kind is the mistaken acceptance of the null hypothesis as the result of a test procedure. That is when a positive result is wrongly predicted as a negative one.

Precision tells us if we have predicted a positive outcome, and how much we are sure it will be true positive. Mathematically we can say, it is the proportion of true positive on all positive predictions.

The recall is the proportion of true positives on all actual positive elements. A recall is also known as a true positive rate.

F1 score is the weighted harmonic mean of the precision and Recall.

Area Under ROC Curve: The Receiver Operating Characteristic (ROC) curve summarizes the performance of the classifier machine learning algorithm over a range of trade-offs between true positive and false positive error rates. For the ROC curve, AUC is used as the performance metric. Area Under ROC Curve represents the likelihood of the model to distinguish observations from two classes.

2. Random Resampling:

It consists of oversampling of the minority class and the undersampling of the majority class.

Let us consider if our target has 20,000 records and 19,900 records belong to the majority class and 100 records belong to the minority class.

In oversampling of the minority class more records are added to the minority class so that it equals the record of the majority. For our cases once oversampling is done the record of the minority class will be 19,900 same as the original majority class record.

It is recommended when the dataset is not too large. The main disadvantage of this method is, that it will lead to overfitting.

In undersampling of the majority class, the records from the majority class are randomly removed. For our cases, once undersampling is done the majority class record will be equal to 100 which is the same as the original minority class record. Thus it is evident that undersampling will lead to loss of information, thus it is recommended for a large dataset because even if we lose some information, it will not be a big deal.

Moreover, undersampling also leads to underfitting and poor generalization of the test set.

3. SMOTE: Synthetic Minority Over-sampling TEchnique:

SMOTE creates synthesizing elements for the minority class, based on the records that it already has in order to reach an equal balance between the minority and majority classes. It randomly picks a point from the minority class and then computes the k-nearest neighbors for that point.

Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. The synthetic points are added between the chosen point and its neighbors.

SMOTE will be more accurate than random under-sampling as all information is retained since we didn’t remove any records. But it will require more time to train the model.

4. Algorithmic Ensemble Techniques :

Here we can use n different classifiers and it uses n different training models to solve the same problem and their predictions are aggregated. It will always have a high accuracy than the individual classifier used for the ensemble. A random forest algorithm consists of many decision trees and utilizes ensemble learning.

The main objective of the algorithmic ensemble technique is to improve performance and provide solutions to complex problems.

5. Use Tree-Based Algorithms:

Decision Tree learning is a predictive modeling approach. It is used to address classification problems in statistics, data mining, and machine learning. It is having a tree-like structure upside down and represents decisions or for decision-making. So it often performs well on imbalanced datasets because of the tree-like hierarchical structure that allows them to learn from both classes.

6. XGBoost — Extreme Gradient Boosting

XGBoost is a short form of Extreme Gradient Boosting.

Gradient boosting is a powerful ensemble machine learning algorithm with combines many classifiers to provide high performance and solve a complex problem. XGBoost is a refined and customized version of a gradient-boosting decision tree system. It implements parallel processing as result it has a high execution speed. It has an inbuilt mechanism to handle missing data.

In Gradient boosting, decision trees are fitted one at a time by minimizing the error gradient. It stops splitting a node as soon as it encounters a negative loss. But XG Boost splits up to the maximum depth specified. Then it prunes the tree backward to remove redundant comparisons or subtrees.

Extreme gradient boosting can be done using the XGBoost package in R and Python.

Thanks for reading!!!! If this article was helpful to you, feel free to clap, share and respond.

--

--