Handling Imbalanced Classes!!!

Abhigyan
Analytics Vidhya
Published in
5 min readSep 13, 2020

One of the many problems with the real world machine learning classification problems is the issue of the imbalanced data.
Imbalanced data means when the classes present in our data disproportionate, Meaning, the ratio of each class differs where one of the class are majorly present in the dataset and the other is minorly present.

Problems with imbalanced data?

Photo by Sebastian Herrmann on Unsplash
  • While the model training it leads to the model to bias over the minority class, meaning, Our model will learn the features of our majority class very well and fail to capture the features of our minority class.
  • Accuracy score can be misleading.

Ways to handle Imbalanced dataset!

Photo by Campaign Creators on Unsplash

→ Collect more data.

This method is often over-looked because it is not possible for most of the use case. However, This method reduces the amount of bias being introduced in the data due to resampling techniques.

→ Use different performance metrics

Performance metrics plays a major rule in explaining the goodness of our trained model.Few of the metrics that should be used are:
* precision score
* recall score
* f-1 score
* precision-recall curve
* kappa

ROC curve should not be used for imbalanced data as that can interpret wrong result. If the model is performing good on the imbalanced data we should use precision-recall curve.
Because in the ROC-AUC Curve False Positive Rate ( False Positives / Total Real Negatives ) does not drop drastically when the Total Real Negatives is huge. Whereas Precision ( True Positives / (True Positives + False Positives) ) is highly sensitive to False Positives and is not impacted by a large total real negative denominator.

→ Trying out different algorithms

Often times experimenting with different algorithms tends to give us best results.
Especially in the case of imbalanced data algorithms like Decision Trees, RandomForest tends to give us best results.

→ Resampling the data

It is one of the most used technique to deal with the problem of imbalanced data.
It randomly resamples the data to balance the class.
It consists of removing samples from the majority class (under-sampling) and(or) adding more examples from the minority class (over-sampling).

  • Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
  • Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
  • Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Over-Sampling:
In this technique, the minority classes are randomly duplicated in order to match the number of majority classes.

1. This technique can be effective for those machine learning algorithms that are affected by a skewed distribution and where multiple duplicate examples for a given class can influence the fit of the model.

2. This might include algorithms that iteratively learn coefficients, like artificial neural networks that use stochastic gradient descent. It can also affect models that seek good splits of the data, such as support vector machines and decision trees.

import imblearn
from imblearn.over_sampling import RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')

X_over, y_over = oversample.fit_resample(X, y)

Refer this link to know more about the Over-sampling function.

Under-Sampling:
In this technique, The majority class are down sampled to the size of minority class by randomly deleting data points from the majority class.

  1. A limitation of undersampling is that examples from the majority class are deleted that may be useful, important, or perhaps critical to fitting a robust decision boundary.
  2. This approach may be more suitable for those datasets where there is a class imbalance although a sufficient number of examples in the minority class, such a useful model can be fit.
import imblearn
from imblearn.over_sampling import RandomUnderSampler
undersample = RandomUnderSampler(sampling_strategy='minority')

X_under, y_under = undersample.fit_resample(X, y)

Refer this link to know more about the Under-sampling function.

Generating Synthetic Samples:

There are systematic algorithms that you can use to generate synthetic samples. The most popular of such algorithms is called SMOTE or the Synthetic Minority Over-sampling Technique.

SMOTE is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies.
The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

from imblearn.over_sampling import SMOTEsm = SMOTE(random_state = 123)
X_train_res,Y_train_res = sm.fit_sample(X,Y.ravel())

Assigning Class Weights Manually:

One of the techniques which avoids any foreign bias entering the data is to assign weights manually to the class.

we assign higher weights to the minority class so that our model gives equal importance to both the class.

Every classification algorithm has a hyperparameter namely “class_weight”.
By default, when no value is passed, the weight assigned to each class is equal e.g., 1.

  • One of the common technique is to assign class_weight=”balanced” when creating instance of the algorithm.
Logistic_model = LogisiticRegression(class_weight =
"balanced").fit(x_train,y_train)
  • Other technique is to assign weights manually to different class labels using syntax such as class_weight={0:2, 1:1}. Class 0 is assigned a weight of 2 and class 1 is assigned a weight of 1.
#class_weights = {class_label : weight}class_weights = {0:2, 1:1}
Logistic_model = LogisiticRegression(class_weight =
class_weights).fit(x_train,y_train)

We can use grid search to search the optimal weight values for our model training

from sklearn.model_selection import GridSearchCVclass_weight = np.linespace(0.05, 1.5, 20)grid_para = {'class_weight' : [{0: x, 1: 1.0-x} for x in 
class_weight]
gridsearch = GridSearchCV(estimator = LogisticRegression(),
param_grid = grid_para,
scoring = 'f1',
cv = 3)
gridsearch.fit(x_train, y_train)
print(gridsearch.best_params_)

After finding the best set of weights we can pass this weights to train our models.

These are the few techniques that can be used to tackle the issue of imbalanced data.
No one method can be said to be best, I highly recommend you to experiment and find out which method suits the best.

HAPPY LEARNING!!!!!

Like my article? Do give me a clap and share it, as that will boost my confidence. Also, I post new articles every Sunday so stay connected for future articles of the basics of data science and machine learning series.

Also, do connect with me on LinkedIn.

Photo by Alex on Unsplash

--

--