Three ways to handle imbalanced data

clarence wu
7 min readNov 6, 2022

--

When dealing with classification problems, one situation many people may encounter is imbalanced data.

What is imbalanced data?

Imbalanced data means the target class label is unequally distributed across the dataset. For example, out of billions of financial transactions, only a few are identified as fraud. Fraud events may account for a very small proportion like 0.05% or 1% which is far smaller than the proportion of regular transactions. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.

The degree of imbalance depends on the proportion of minority classes as shown below.

Why do we need to handle imbalanced data?

When providing imbalanced data points to algorithms, the training model will spend most of its time on majority classes and learn little from minority classes, thereby “ignoring the minority classes”. Because of this, if we do nothing to this problem, the model may wrongly classify minority classes as the majority such as failing to recognize fraud.

Three ways to handle imbalanced dataset

After understanding what is imbalanced dataset and why we need to deal with it, I will introduce four ways which are UnderSampling, OverSampling and adjusting class weight to handle an imbalanced dataset. I will use some sample data to show how to use these four methods.

Create an imbalanced dataset

First, let us create an imbalanced dataset using make_classification from Scikit-learn.

from sklearn.datasets import make_classificationX, y = make_classification(
n_samples=5000, weights=[0.01, 0.99],
random_state=0, n_clusters_per_class=1)
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})df['target'].value_counts(normalize = True)# 1 0.9842
# 0 0.0158
# Name: target, dtype: float64

The code above generates 5000 data points and among them, 99% of data belongs to class 1, and 1% is class 0.

Then, split the dataset into train and test. The parameter stratify=y means data is split in a stratified fashion, using this as the class label.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Then, we will directly train RandomForestClassifier with the imbalanced dataset and do nothing to the imbalanced. This processing method can be regarded as a ‘control group’. we can change the way dealing with the imbalance, and then compare the performance with the ‘control group’.

from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

The next step is to choose the right metrics to evaluate the performance of the model. For classification tasks, metrics including accuracy, F1-score, ROC curve and confusion matrix are all choices. However, for an imbalanced dataset, we should be careful, because some metrics such as F1-score and accuracy may “tell a lie”. Let me show you!

from sklearn.metrics import f1_score, accuracy_scoreprint("F1 Score is ", f1_score(y_test, clf.predict(X_test)))
print("Accuracy Score is ", accuracy_score(y_test, clf.predict(X_test)))
# F1 Score is 0.9919354838709677
# Accuracy Score is 0.984

The performance seems great: the F1 score is 0.996, and the accuracy score is 0.993. Both of them are almost close to 1, which means the model nearly predicts all test samples correctly. But does the model really performs well as the metrics show?

Let us use confusion_matrix to evaluate the model

from sklearn.metrics import plot_confusion_matrixplot_confusion_matrix(clf, X_test, y_test)
plt.show()

From the confusion matrix, all class 0 are wrongly categorized, so the accuracy score for class 0 is 0. Such misclassified predictions are highly detrimental in the context of detecting rare frauds or predicting uncommon malignant diseases. So for the imbalanced dataset, we can not simply just use F1-score or accuracy score to evaluate the model. After understanding the metric problem, next, let us try the four methods to handle an imbalanced dataset to improve the misclassified problem.

Undersampling

The first method is Undersampling which means removing or reducing the majority of class samples to balance the class label.

RandomUnderSampler provided by imbalanced-learn is a fast and easy way to implement undersampling by randomly selecting a subset of data for the targeted classes. Let us try it.

from imblearn.under_sampling import RandomUnderSamplerunder_sampler = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = under_sampler.fit_resample(X_train, y_train)
print("Number of records for X_train is ", X_train.shape)
print("Number of records for X_resampled undersampling is ",X_resampled.shape)
# Number of records for X_train is (4000, 20)
# Number of records for X_resampled undersampling is (126, 20)

Let’s check the target class distribution after resampling.

df = pd.DataFrame({'label':y_resampled})
df.value_counts(normalize=True)
# label
# 0 0.5
# 1 0.5
# dtype: float64

We can see now after Undersampling, the size of the training dataset dropped to 126, and the proportion for both target class label are 0.5.

clf=RandomForestClassifier(max_depth=2,random_state=0).fit(X_resampled, y_resampled)
print("F1 Score is ", f1_score(y_test, uclf.predict(X_test)))
print("Accuracy Score is ", accuracy_score(y_test, uclf.predict(X_test)))
# F1 Score is 0.9369951534733441
# Accuracy Score is 0.883
plot_confusion_matrix(uclf, X_test, y_test)

We can see only 3 minority classes are wrong classified which is a huge improvement compared with our ‘control group’. However, the model performs worse for the majority class because 114 majority classes are incorrectly predicted. The reason behind this problem is that undersampling reduces the number of the majority class and because of this we lost too much information about the majority class.

Oversampling

Oversampling refers to the technique to create artificial or duplicate data points or of the minority class sample to balance the class label

RandomOverSampler over-samples the minority class(es) by picking samples at random with replacement.

from imblearn.over_sampling import RandomOverSamplerros = RandomOverSampler(random_state=0)
X_resampled,y_resampled = ros.fit_resample(X_train,y_train)
print("Number of records for X_train is ", X_train.shape)
print("Number of records for X_resampled oversampling is ",X_resampled.shape)
ros = RandomOverSampler(random_state=0)
X_resampled,y_resampled = ros.fit_resample(X_train,y_train)
print("Number of records for X_train is ", X_train.shape)
print("Number of records for X_resampled oversampling is ",X_resampled.shape)
# Number of records for X_train is (4000, 20)
# Number of records for X_resampled oversampling is (7874, 20)
# check target distributiondf = pd.DataFrame({'target':y_resampled})
df.value_counts(normalize=True)
# target
# 0 0.5
# 1 0.5
# dtype: float64

The size of the training dataset goes from 4000 to 7874 by over-sampling. Let us train and evaluate the model with the new dataset.

clf = RandomForestClassifier(max_depth=2, random_state=0).fit(X_resampled, y_resampled)print("F1 Score is ", f1_score(y_test, clf.predict(X_test)))
print("Accuracy Score is ", accuracy_score(y_test, clf.predict(X_test)))
# F1 Score is 0.990311065782764
# Accuracy Score is 0.981
plot_confusion_matrix(clf, X_test, y_test)

After oversampling, only 6 class 0 are incorrectly classified. Besides, the error rate for class 1 also decreases. This method seems great, however, there is a problem which is oversampling may lead to overfitting.

Adjust class weight

The last way is to adjust the class weight. As I mentioned earlier, the model regards the majority class as more important than the minority class. To balance this unequal importance, we can adjust the weight and give more importance to the minority class. After applying a weight to each class in the loss function, the minority class will be weighted higher than the majority class, so that the model ends up giving equal weight to both classes when learning. Let us try it.

weighted_clf = RandomForestClassifier(max_depth=2, random_state=0,class_weight={0:40,1:1})
weighted_clf.fit(X_train, y_train)
print("F1 Score is ", f1_score(y_test, weighted_clf.predict(X_test)))
print("Accuracy Score is ", accuracy_score(y_test, weighted_clf.predict(X_test)))
# F1 Score is 0.9964556962025316
# Accuracy Score is 0.993
plot_confusion_matrix(weighted_clf, X_test, y_test)
plt.show()

The model performs well both in the majority and minority classes. While one extra job we need to do is choose the appropriate weight, and maybe we can use grid search to do it. One recommends way is to start with the inverse of the ratio. For example, if the proportion of the minority is 0.1, then the weight is 1/0.1; the proportion of the majority class is 0.9, so the weight is 1/0.9.

Conclusion

In this article, I introduce what is imbalanced data and why we need to handle it. Also, I introduce three ways to deal with this problem and the drawbacks of these methods. Be careful to choose the right metrics to evaluate the model in this situation. Besides, SMOTE(Synthetic Minority Oversampling Technique) is also a popular way which can generate synthetic data points.

--

--

clarence wu

data science master graduated from Glasgow University