Imbalanced Classes

Anjali Dharmik
8 min readFeb 1, 2023

--

When observation in one class is higher than the observation in other classes then there exists a class imbalance.

Often in cases where the imbalance is caused by a sampling bias or measurement error, the imbalance can be corrected by improved sampling methods, and/or correcting the measurement error. This is because the training dataset is not a fair representation of the problem domain that is being addressed.

A classification data set with biased or skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.

What counts as imbalanced? The answer could range from mild to extreme, as the table below shows.

Degree of imbalance

Proportion of Minority Class

Mild : 20–40% of the data set

Moderate: 1–20% of the data set

Extreme: <1% of the data set

Example: To detect mobile application popularity in marketplace. As you can see in the below graph High popularity is around 1450 when compared with low popularity around 525.

Problem with Class Imbalance in Machine Learning

Class Imbalance is a common problem in machine learning, especially in classification problems. Imbalance data can hamper our model accuracy big time. With so few positives relative to negatives, the training model will spend most of its time on negative examples and not learn enough from positive ones.

Most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce errors.

However, if the data set in imbalance, then In such cases, you get a pretty high accuracy just by predicting the majority class, but you fail to capture the minority class, which is most often the point of creating the model in the first place.

Handle Imbalance class

If you have an imbalanced data set, first try training on the true distribution. If the model works well and generalizes, you’re done! If not, try the following down sampling and upweighting technique.

Resampling Technique

One of the widely adopted class imbalance techniques for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting.

In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.

1. Random Under-Sampling

Undersampling can be defined as removing some observations of the majority class. This is done until the majority and minority class are balanced out.

Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback to undersampling is that we are removing information that may be valuable.

2. Random Over-Sampling

Oversampling can be defined as adding more copies to the minority class. Oversampling can be a good choice when you don’t have a ton of data to work with.

A con to consider when undersampling is that it can cause overfitting and poor generalization to your test set.

Implementation of resampling techniques

resampling using imbalanced-learn

1. Random under-sampling with imblearn

RandomUnderSampler is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes. Under-sample the majority class(es) by randomly picking samples with or without replacement.

2. Random over-sampling with imblearn

One way to fight imbalance data is to generate new samples in the minority classes. The most naive strategy is to generate new samples by randomly sampling with replacement of the currently available samples. The RandomOverSampler offers such a scheme.

3. Under-sampling: Tomek links

Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.

Tomek’s link exists if the two samples are the nearest neighbors of each other

4. Synthetic Minority Oversampling Technique (SMOTE)

This technique generates synthetic data for the minority class.

SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

SMOTE algorithm works in 4 simple steps:

1. Choose a minority class as the input vector

2. Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function)

3. Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor

4. Repeat the steps until data is balanced

5. NearMiss

NearMiss is an under-sampling technique. Instead of resampling the Minority class, using a distance, this will make the majority class equal to the minority class.

Performance Metric

Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be misleading.

Metrics that can provide better insight are:

1. Confusion Matrix: a table showing correct predictions and types of incorrect predictions.

2. Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.

3. Recall: the number of true positives divided by the number of positive values in the test data. The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.

4. F1: Score: the weighted average of precision and recall.

5. Area Under ROC Curve (AUROC): AUROC represents the likelihood of your model distinguishing observations from two classes.
In other words, if you randomly select one observation from each class, what’s the probability that your model will be able to “rank” them correctly?

Penalize Algorithms (Cost-Sensitive Training)

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.

A popular algorithm for this technique is Penalized-SVM.

During training, we can use the argument class_weight=’balanced’ to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

We also want to include the argument probability=True if we want to enable probability estimates for SVM algorithms.

Change the algorithm

While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets.

Decision trees frequently perform well on imbalanced data. In modern machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll jump right into those:

Tree base algorithm work by learning a hierarchy of if/else questions. This can force both classes to be addressed.

Advantage and disadvantages of Under-sampling

Advantages

It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

Disadvantages

1. It can discard potentially useful information which could be important for building rule classifiers.

2. The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population. Thereby, resulting in inaccurate results with the actual test data set.

Advantages and Disadvantage of over-sampling

Advantages

1. Unlike under-sampling, this method leads to no information loss.

2. Outperforms under sampling

Disadvantages

1. It increases the likelihood of overfitting since it replicates the minority class events.

Why Downsample and Upweight?

It may seem odd to add example weights after downsampling. We were trying to make our model improve on the minority class — why would we upweight the majority? These are the resulting changes:

1. Faster convergence: During training, we see the minority class more often, which will help the model converge faster.

2. Disk space: By consolidating the majority class into fewer examples with larger weights, we spend less disk space storing them. This savings allows more disk space for the minority class, so we can collect a greater number and a wider range of examples from that class.

3. Calibration: Upweighting ensures our model is still calibrated; the outputs can still be interpreted as probabilities.

Practice on Real time dataset

# Objective: Handle Imbalance Data
#Importing Libraries

import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from collections import Counter
import joblib

from sklearn import preprocessing
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score

%matplotlib inline

# Data Acquisition
#define path of data folder
data_path = 'D:/JOB PREP/Practice and Projects/MobilePopularity/data/'
data = pd.read_csv(data_path+"train.csv")

#sample data
data.head()
# Shape of dataset
data.shape

def cat_analysis(x):
count =x.value_counts()

ax = count.head(8).plot(kind='barh', title=x.name)
ax.set_xlabel(x.name)
ax.set_ylabel("Count of "+str(x.name))
plt.show()

percent = x.value_counts( normalize=True)
print(pd.concat([count,percent], axis=1, keys=['counts', '%']))

return ''

data[['popularity']].apply(cat_analysis)

popularity is target variable which has two categories low and high. where 73.42% of apps belong to high category which makes data imbalance.

# Feature Handling
### Data Encoding
# Import label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

data_encoded = data.apply(label_encoder.fit_transform)

data_encoded.shape

# Define independent and target features
x = data_encoded.iloc[:, :-1]
y = data_encoded.iloc[:, -1]

# 1. Random under-sampling with imblearn
# import library
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
x_rus, y_rus = rus.fit_resample(x, y)

print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))

# 2. Random over-sampling with imblearn
# import library
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

# fit predictor and target variable
x_ros, y_ros = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

# 3. Under-sampling: Tomek links
# import library
from imblearn.under_sampling import TomekLinks

tl = RandomOverSampler(sampling_strategy='majority')

# fit predictor and target variable
x_tl, y_tl = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

# 4. Synthetic Minority Oversampling Technique (SMOTE)
# import library
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_smote))

# 5. NearMiss
from imblearn.under_sampling import NearMiss

nm = NearMiss()

x_nm, y_nm = nm.fit_resample(x, y)

print('Original dataset shape:', Counter(y))
print('Resample dataset shape:', Counter(y_nm))

# Penalize Algorithms (Cost-Sensitive Training)
# split data into train and test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# load library
from sklearn.svm import SVC

# we can add class_weight='balanced' to add panalize mistake
svc_model = SVC(class_weight='balanced', probability=True)

svc_model.fit(x_train, y_train)

svc_predict = svc_model.predict(x_test)# check performance

print('ROCAUC score:',roc_auc_score(y_test, svc_predict))
print('Accuracy score:',accuracy_score(y_test, svc_predict))
print('F1 score:',f1_score(y_test, svc_predict))
ROCAUC score: 0.6639082462253194
Accuracy score: 0.549367088607595
F1 score: 0.5265957446808511

Reference:

https://neptune.ai/blog/how-to-deal-with-imbalanced-classification-and-regression-data

https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

https://machinelearningmastery.com/what-is-imbalanced-classification/

https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data

https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

--

--