Exploring Pyspark.ml for Machine Learning: Handling Class Imbalances Part 2/2

Sze Zhong LIM
Data And Beyond
Published in
5 min readNov 17, 2023

Dealing with class imbalances in datasets is a common and critical challenge. In this article, I will provide a detailed and easy-to-understand guide on how to handle class imbalances within a dataset using PySpark and pandas. This article aims to equip you with the knowledge and practical code examples to effectively address this issue.

This article will be a continuation from Part 1.

AI Generated Photo by Adobe Firefly. Prompt used: Class Imbalance.

Important Note before Data Resampling

There are 2 important things to note before any data resampling:

  1. To split the data into Train / Test Data BEFORE the data is resampled.
  2. Only the train data will be undersampled. The test data will not be affected. This is because we are training the model with the resampled data.

Oversampling the Minority

Below is the jupyter notebook for oversampling the minority using PySpark.

The thing to note in the oversample method i provided above is that the ratio of oversampling depends on the initial ratio. The quality of the minority is also important. If we do not have good quality minority class, oversampling it again and again will just lead to a bad model which would not be able to predict future data correctly.

oversampled_minority = df.filter(F.col('class')==1)\
.sample(withReplacement=True, fraction=(balance_ratio/ratio),seed=88)

I want to highlight the difference between the undersampling and oversampling. In oversampling, we choose withReplacement=True as we are going to randomly choose a sample, then be able to reuse that particular sample for another random choosing. This is very different from undersampling where withReplacement=False as we do not want to repick any of the majority when undersampling.

The code I provided allows you to control the ratio of Majority:Minority in the training set.

From the table below, we can see the comparison between the baseline results vs the undersampling results vs the oversampling results.

Summary table extracted from the Jupyter Notebook

We can see that as the oversampling majority:minority ratio increased, the recall went down, and the True Positives went down. This is kind of logical as the model is given a more balanced representation of the minority class as such the recall will increase. One thing to note is that we will also worry about overfitting especially if the oversampling is huge (for a very small amount of minority class).

To take note that this inference is just a logic, and it might be different for different datasets.

SMOTE (Synthetic Minority Over-Sampling Technique)

In simple terms, SMOTE is an oversampling technique where synthetic (made up) samples are generated for the minority class. Basically, SMOTE identifies the minority class examples that are in the vicinity of one another and creates synthetic samples that are combinations of these nearby instances.

In general sense, this it how the samples are created:
1) For each selected instance, create synthetic samples by interpolating features between the instance and its k-nearest neighbors.
2) The interpolation is performed by selecting a random value between 0 and 1 for each feature and applying it to the selected instance’s features.
3) This process generates synthetic samples that resemble the minority class but introduce small variations.

A photo obtained from @parth dholakiya. Link here.

This algorithm helps to overcome the overfitting problem posed by random oversampling, in which your samples are just repeated again and again.

There are various SMOTE variants out there.

Some resources you may read up on:
1) SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors
2) Imbalanced-Learn Section 2.1.4
3) SMOTE for Imbalanced Classification with Python by Machine Learning Mastery

In this article, I will be purely sharing about / using the SMOTE as described by hwangdb. His article and code can be found here. There was also some modifications which were done by Inguelberth which i found in the link here.

To note that SMOTE will in general perform better if your dataset has fewer features / less dimensions. It might not operate as well when there are much more dimensions. However, it is best to test it directly on the dataset.

You may find the code for SMOTE for pyspark as below.

You may play around with config based on the bucketLength, the “k”, or the multiplier. More explanation below.

The bucket length is a parameter that determines the step size for the number of synthetic samples to generate during SMOTE. Basically, it controls the granularity of oversampling. In short, the bucket length controls the spacing between synthetic samples in terms of their proximity to the original minority class instances.

The multiplier is to determine how many synthetic samples to create. It controls the total number of samples to oversample.

“k” refers to the number of nearest neighbors used to select the neighboring instances when generating synthetic samples for the minority class. It plays a crucial role in determining the characteristics of the synthetic samples. Smaller k will have less diversity. Higher k will have more diversity.

From the table below, we can see the comparison between the baseline results vs the undersampling results vs the oversampling results vs the SMOTE results.

Summary table extracted from the Jupyter Notebook

We can see that:
1) There was significant improvement in ALL data resampling methods (except for Undersampling 1:1) as compared to the baseline model.
2) Undersampling 2:1 worked the best in the case for this dataset.

Feel free to play around with the whole code in Jupyter Notebook where it can be found here.

Additional Info

You may try the project in this link, that goes on about how to handle class imbalance but using pandas and scikit-learn.

--

--