Handling imbalanced class in machine learning classifiers

Haneul Kim
Analytics Vidhya
Published in
4 min readMar 13, 2021
Photo by Aziz Acharki on Unsplash

Lots of people probably have learned that when class is imbalanced they need to use performance metrics such as recall, precision, f1-score or ROC curve over standard accuracy measure.

I’m not sure if it’s just me however I didn’t know about know training with imbalanced data influenced model performance. While studying deep learning concepts from YouTube(@4:10) it has told me that imbalance class label affects how model is trained and best method is to oversample class that has less.

This got me thinking “What about in machine learning algorithms? why did I never try to balance class labels?”

After some research I’ve found out that imbalanced class label affects ML model as well and there are multiple techniques to address this issue. Today I will explain and experiment to see if handling imbalanced class label when training actually improves our ML models.

Here are some different techniques we will cover today:

  1. Oversampling
  2. Undersampling
  3. Adjusting class weights

We will be using Credit Card Fraud Detection data offered by Kaggle.

Lets load dependencies, data then split data into train and test set.

Our dataset is highly imbalanced consisting

  • 284315 normal transactions (class = 0)
  • 492 fraudulent transactions (class = 1)

which is about 577:1 ratio.

Now we will create result_df to store and keep track performance of different techniques which will be used later for comparison.

Also two functions will be created:

  1. Function that train logistic regression with given X_train, y_train and outputs predicted labels on newly seen X_test data
  2. Function for Oversample, Undersampling

Using above functions we will make prediction using baseline Logistic Regression model, randomly oversampled data, and randomly undersampled data. All performances will be stored in result_df.

Next we will randomly oversample our data therefore it contains 1:1 ratio of normal and fraudulent transactions.

Oversampling refers to randomly choosing data from minority class with replacement until number of minority class becomes same as majority class.

Below, we can confirm indeed there are same amount of data for normal and fraudulent transactions.

Using above oversampled dataset we will use logistic regression to classify transaction and store its performances in result_df.

Undersampling refers to randomly choosing len(minority_class) from majority class.

We can see that there is 1:1 ratio however this time we’ve truncated amount of normal transaction data.

Using above under sampled dataset let’s make prediction using Logistic Regression and store its performance.

Instead of writing lines of code to oversample or undersample, sklearn provides class_weight parameters for its classifier algorithms.

You can pass a dictionary indicating weights for each class or “balanced” to automatically adjust weights by using n_sample / (n_classes * np.bincount(y).

It’s default is None which indicates that each class have weight of 1. In our case, it assumes that we have 1:1 ratio of normal and fraudulent transactions.

Before we go on I just want to check if passing in “balanced” actually does a good job of adjusting weights.

The ratio of normal to fraudulent data are 231226:379 in our training dataset which is about 562.6 : 1.

Using n_sample/(n_classes*np.bincount(y) we get 0.5:281 which is about 1:563 therefore we can now trust sklearn.

Enough skepticism, let’s increase weight of fraudulent data by 20 until it reaches 562 therefore we can see what happens to performance as class imbalance gets adjusted.

Time to see everything we’ve stored in result_df.

Note that baseline = weight_adjusted_1to1 and weight_adjusted = weight_adjusted_563to1 (561 in our case but you get the point).

Now, let’s visualize what is happening to Logistic Regression model as our imbalanced training data gets balanced.

As imbalanced data becomes balanced here are three things we can see:

  1. Increase in recall
  2. Decrease in Precision
  3. Decrease in accuracy

In conclusion, adjusting imbalance before training does not seem to be definitive answer to all problems however by looking at our results we can see that balancing imbalanced data would be beneficial if our goal is to increase recall.

It would be important to understand why adjusting imbalanced data increases recall but decrease precision , I need to do more research and test it on my own before writing about it which I will do soon(hopefully). For now you can take a look at here which is explained at high level.

There are other techniques to adjust imbalance datasets, one very popular technique is SMOTE(Synthetic Minority Oversampling Technique) which is oversampling technique however it oversamples by creating new, unseen data from minority class.

Thanks for reading and please comment if there is any misinformation :)

--

--

Haneul Kim
Analytics Vidhya

Data Scientist passionate about helping the environment.