Balancing Imbalanced Data: Undersampling and Oversampling Techniques in Python

Daniele Santiago
3 min readJun 5, 2023

--

Datasets with over 50% of entries belonging to a single class are considered imbalanced. Most machine learning algorithms perform better with balanced datasets as they aim to optimize overall classification accuracy or related measures. In cases of imbalanced data, decision boundaries established by the algorithms tend to favor the majority class, leading to incorrect classification of the minority class.

To solve this problem, it is necessary to use metrics that take the imbalance into account and apply techniques to address this issue. Sampling techniques such as Undersampling and Oversampling are standard methods for dealing with class imbalance. This article presents an approach to implementing these techniques in Python.

In general, under-sampling involves removing examples from the majority class to make the class proportions more balanced. On the other hand, over-sampling involves generating new examples for the minority class to increase its representation in the dataset.

If you’re interested, a more detailed explanation of the differences between these techniques is provided in this article. Here, we will only cover their practical use. However, before we proceed with balancing, it is necessary to split the dataset into training and testing sets, which can be done as follows.

# Split variables into X and y
X = df_clean.drop('Class', axis=1)
y = df_clean['Class'] # target variable

# Split the dataset into training and validation
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, shuffle=True, test_size=0.15)

Undersampling

To apply the undersampling technique, we will use the RandomUnderSampler algorithm, which randomly removes instances. It is available in the imbalanced-learn library.


# Import the necessary libraries
from imblearn.under_sampling import RandomUnderSampler

Now we will create a RandomUnderSampler object with the random_state set to 42. This ensures that the random selection is reproducible when running the code multiple times. We can also set the sampling strategy as ‘majority’, where only the majority class will have instances removed. The various available parameters can be found in the documentation.

# Create a RandomUnderSampler object
rus = RandomUnderSampler(random_state=42, sampling_strategy = 'majority')

Finally, we will apply the RandomUnderSampling technique to the input data. The fit_resample function fits the RandomUnderSampler object to the data and returns the balanced data.

# Balancing the data
X_resampled, y_X_resampled = rus.fit_resample(X_train, y_train)

Oversampling

For oversampling, we will use SMOTE, which is a widely used technique in classification problems where the minority class is significantly smaller than the majority class. The technique works by selecting an example from the minority class and finding its k nearest neighbors. It then creates new synthetic examples by randomly interpolating the attributes of the selected examples and adding them to the dataset.

To use it, we will import the necessary libraries.

# Import the necessary libraries
from imblearn.over_sampling import SMOTE

Similarly, we will create an instance of the SMOTE object, which will be applied to the training data to perform the oversampling and balance the data.

# Creating an instance of SMOTE
smote = SMOTE()

# Balancing the data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

As a result, we will have balanced data with new instances added. More information about parameters can be found in the documentation.

Conclusion

Data balancing is one of the steps in data preprocessing that aims to improve model performance. In this article, we demonstrated two algorithms, RandomUnderSampling and SMOTE, one for each available technique, and their implementation in Python. This article aimed to introduce two simple techniques, but it is important to note that there are several algorithm options available for each technique, such as ClusterCentroids, ADASYN, or even combinations of undersampling and oversampling, which can be studied and implemented to further enhance your results.

To better understand the theory behind sampling and other techniques, be sure to check out the latest article published: Effective Strategies for Dealing with Imbalanced Datasets.

Did you find this article helpful?

Follow me on social media:

--

--