Exploring Pyspark.ml for Machine Learning: Handling Class Imbalances Part 1/2

Published in

Data And Beyond

6 min readNov 11, 2023

AI Generated Photo by Adobe Firefly. My thought of a model running on what is obviously an imbalance in weight but still appearing as normal.

In most cases, we will deal with binary class imbalance, or multi-class imbalance. Below are some type of imbalances, their scenario, and real-life examples.

For this article, we will be working on the simplest Binary Class Imbalance. (Only 2 classes, where 1 class is significantly more than the other).

Why Class Imbalances Are Challenging?

Handling class imbalances is challenging because of 3 main points. Biased Models, Evaluation Metrics, and Data Resampling.

Biased Models — Models trained on imbalanced data tend to be biased towards the majority class. They may misclassify the minority class more frequently. In the later versions of pyspark.ml, there have been input arguments that allow us to tweak on this function by allocating the weightage of different classes. However, it may or may not work out according to my experimentations.

Evaluation Metrics — Traditional metrics such as accuracy can be pretty misleading when imbalances exist. It is important to consider the correct metric we would judge the model with, depending on the type of project we are on. It could be precision, recall, F1-Score, or Area under the Receiver Operating Characteristic curve (AUROC)

Data Resampling — A common method to address imbalances is to do data resampling, where we either oversample the minority class, undersample the majority class, or generate synthetic data (SMOTE). Deciding which technique to use is crucial and also based on the dataset you have at hand.

For Data Resampling methods, I will be focusing on:

Undersampling the Majority Class
Oversampling the Minority Class
Synthetic Minority Over-Sampling Technique (SMOTE)

Data Resampling Comparison and Use Cases

The common question is when to use which data resampling method. Trying all three methods can be computationally expensive (for big datasets and multiple cross validations), but more importantly, we want to ensure the good results we get are explainable and not just due to luck.

In short:

Undersampling reduces the size of the majority class, potentially leading to a loss of information but is computationally efficient.

Oversampling increases the size of the minority class and is suitable when you have a small dataset but enough computational resources.

SMOTE generates synthetic samples by interpolating between existing minority class samples, introducing diversity, and handling complex decision boundaries.

Below is a table of the typical different use cases:

The above table is just a general guideline and it highly depends on the data and the features. It would be good to do a deeper EDA on the features before deciding which to use.

If your not familiar with data resampling, some of the things to note / pitfalls of data resampling which you may look further into are:

Overfitting
Loss of Information
Impact on Decision Boundaries
Data Leakage (Resampling done to the training dataset only)
Data Distribution Change

Hands On Experimentation with PySpark

As there are already many articles on how to use pandas and scikit learn to handle class imbalances, this article will focus particularly on using Pyspark for processing the class imbalances. I will also try to keep things rudimentary as I have had difficulties installing packages on legacy systems. This approach ensures that we minimize reliance on external packages or modules that may have dependencies not readily available within our system environment. By doing so, we can maintain greater control over the code’s execution and ensure compatibility with our existing infrastructure.

There is a guide here about how to install pyspark on your local anaconda jupyter notebook if you are not already set up.

You may start a project by first setting up a pyspark dataset which has a Binary Class Imbalance. I have created a Jupyter Notebook that creates the mock dataset of 100k samples with a severe class imbalance of 0.5%. ( I included this part in case some people are interested in creating / converting their dataset from another format)

Alternatively, you could use a creditcard.csv dataset i obtained from Kaggle at this link. For the remainder of this project, we will be using the Kaggle creditcard.csv dataset.

You will notice that I have split the data into train and test and I have double checked that the fraction of the minority in the training and test size are similar.

Baseline Model

I created a baseline model of a Random Forest Classifier Model so that we can compare this result with the use of data resampling. I have included some toPandas() code to tidy up the table to show you what is in the table created.

Edit: I forgot to change the ‘Class’. You may edit it as per the screenshot below. The final copy has been modified with the correct code.

The readings from the baseline model show:
AUROC = 0.928
Accuracy = 0.956
Recall = 0.048
With a Confusion matrix as per below:

It can be observed that the recall is really low, and that the model is not able to correctly predict Class 1 (which is what we want).

We will explore using 3 methods to see whether the model can be improved or not.

Undersampling the Majority
Oversampling the Minority
Synthetic Minority Over-Sampling Technique (SMOTE)

There are 2 important things to note before any data resampling:

To split the data into Train / Test Data BEFORE the data is resampled.
Only the train data will be undersampled. The test data will not be affected. This is because we are training the model with the resampled data.

Undersampling the Majority

Below is the jupyter notebook for undersampling the majority using PySpark.

In my example above, I have given several examples of how even when undersampling the majority class, we have the option to choose the type of undersampling ratio we want. In my example, i used a undersampling method that results in a Majority:Minority ratio of 1:1, 2:1, and 5:1. Which ratio being used is highly dependent on the type of algorithm you will be using and also on the quality of the data and features.

From the table below, we can see the comparison between the baseline results and the Undersampling results.

Summary table extracted from the Jupyter Notebook

The undersampling 1:1 performed really poorly with a very low AUROC / Recall / Accuracy. The undersampling 2:1 performed the best with the highest AUROC, and a significantly high recall of 0.929.

Both undersampling 2:1 and undersampling 5:1 have a significantly better recall than the baseline model. That is what we want. For the model to be able to predict and give us the True Positives (TP).

I added a code snippet for Pandas just for record.

# Binary Class Undersampling Majority Function for Pandas
def bin_class_undersample_majority(df_input, col_name, maj_min_ratio, rand_state=88):
    '''
    maj_min_ratio is the majority:minority ratio
    '''
    class_counts = df_input[col_name].value_counts()
    min_class = class_counts.idxmin()
    min_class_cnt = class_counts[min_class]
    ttl_cnt = len(df_input)
    maj_class_cnt = ttl_cnt - min_class_cnt
    max_maj_min_ratio = maj_class_cnt/min_class_cnt

    maj_idx = df_input[df_input[col_name]!=min_class].index
    if maj_min_ratio > max_maj_min_ratio:
        print(f'Majority:Minority Ratio input of {maj_min_ratio} is more than actual max ratio of {max_maj_min_ratio}')
        print(f'Maximum Majority:Minority Ratio of {max_maj_min_ratio} will be used.')
        undersampled_maj_idx = df_input.loc[maj_idx].sample(int(min_class_cnt*max_maj_min_ratio), random_state=rand_state).index
    else:
        undersampled_maj_idx = df_input.loc[maj_idx].sample(int(min_class_cnt*maj_min_ratio), random_state=88).index
    min_idx = df_input[df_input[col_name]==min_class].index
    undersampled_idx = list(undersampled_maj_idx) + list(min_idx)

    undersampled_df = df_input.loc[undersampled_idx]

    return undersampled_df

We will cover Oversampling of Minority and SMOTE in Part 2 here.