Handling Imbalanced data sets in Machine Learning

ITBodhi
12 min readJul 21, 2020

--

What are the Best Practices, Techniques and Tools to make Right Model with Imbalanced data set?

Photo by Colton Sturgeon on Unsplash

Let’s Start

Suppose that you have to build a classification model on the given Covid-19 data set.You have applied your favourite algorithm and achieved 94% accuracy in predicting Covid Positive or Covid Negative.But still your boss is not happy and threw your model into the trash. You are surprised that model is working fine without any visible defect and with higher accuracy.

After close analysis you found that Covid19 infection rate is only 6–7% and that is why in given data set Positive class is only 7% and Negative class is around 93%. That is highly Imbalanced data set and my model is behaving in a very interesting way and predicting every case as Negative class and giving accuracy as high as 94% but not predicting any positive class correctly.That is actually a blunder not predicting Covid19 patient as positive while it is.

Why my Model is behaving insanely?

No, model is actually doing the right thing but my way of training the model is wrong. I am focusing on the wrong thing !

What am I doing wrong?

My Approach is biased with my knowledge of evaluation matrices which are actually befooling me.ACCURACY is not the right matrix when working for the Imbalanced data set.

Let’s refresh the memory:Confusion matrix, Precision, Recall and F1

Confusion matrix gives an interesting overview of how well a model is doing. Thus, it is a great starting point for any classification model evaluation. We summarise most of the metrics that can be derived from the confusion matrix in the following graphic:

Let us give a short description of these metrics.

  • The accuracy of the model is basically the total number of correct predictions divided by total number of predictions.
  • The precision of a class define how trustable is the result when the model answer that a point belongs to that class.
  • The recall of a class expresses how well the model is able to detect that class.
  • The F1 score of a class is given by the harmonic mean of precision and recall (2×precision×recall / (precision + recall)), it combines precision and recall of a class in one metric.

For a given class, the different combinations of recall and precision have the following meanings :

  • high recall + high precision : the class is perfectly handled by the model
  • low recall + high precision : the model can’t detect the class well but is highly trustable when it does
  • high recall + low precision : the class is well detected but the model also include points of other classes in it
  • low recall + low precision : the class is poorly handled by the model

In our introductory example, we have the following confusion matrix for 10000 products.

The accuracy is 96.2% as said earlier. The non defective class precision is 96.2% and the defective class precision is not computable. The recall of the non defective class is 1.0 which is perfect (all the non defective products have been labelled as such). But the recall of the defective class is 0.0 which is the worse case (no defective products were detected). Thus, we can conclude our model is not doing well for this class. The F1 score is not computable for the defective products and is 0.981 for the non defective products. In this example, looking at the confusion matrix could have led to re-think our model or our goal (as we will see in the following sections). It could have prevented using a useless model.

Challenges with Imbalanced data Set?

The conventional model evaluation methods do not accurately measure model performance when faced with imbalanced datasets.

Standard classifier algorithms like Decision Tree and Logistic Regression have a bias towards classes which have number of instances. They tend to only predict the majority class data. The features of the minority class are treated as noise and are often ignored. Thus, there is a high probability of misclassification of the minority class as compared to the majority class.

Imbalanced datasets can be found for different use cases in various domains:

  • Finance: Fraud detection datasets commonly have a fraud rate of ~1–2%
  • Ad Serving: Click prediction datasets also don’t have a high clickthrough rate.
  • Transportation/Airline: Will Airplane failure occur?
  • Medical: Does a patient has cancer?
  • Content moderation: Does a post contain NSFW content?

Handling Imbalanced Data: Best Practices and Approaches

1. Collect More Data:

A larger dataset might expose a different and perhaps more balanced perspective on the classes.

2. Try Changing Your Performance Metric:

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.

Looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:

  • Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
  • Precision: A measure of a classifiers exactness.
  • Recall: A measure of a classifiers completeness
  • F1 Score (or F-score): A weighted average of precision and recall.
  • Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
  • Adjust the decision threshold
  • Adjusting misclassification costs

3. Cost-sensitive classifiers

May be used for unbalanced data sets by setting a high cost to the misclassifications of a minority class example.

4. Boosting Algorithm

AdaCost, WEKA, AdaBoost, Gradient Boost, XGBoost:xgboost offers parameters to balance positive and negative weights using scale_pos_weight(https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees)

5. Weighting of examples

It involves the creation of specific weight vectors in order to improve minority class predictions

The class-specific weights(class_weight parameter) are calculated per class whereas the test-case-specific weights are calculated for each single instance.https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

6. Try Different Algorithms

Run a lot of tests on multiple models. Intuition can take you a long way in data-science — if your gut tells you that an ensemble of classifiers will give you the best results, go ahead and try it.

7. Use Stratified CV

8. Penalized SVM

In SVM where it is desired to give more importance to certain classes or certain individual samples, the parameters class_weight and sample_weight can be used.

9. Bagging may give interesting results.

10. Data Level approach

Let’s apply some of resampling techniques using the Python library imbalanced-learn. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

For visualization, let’s create a small unbalanced sample dataset using the make_classificationmethod:

import imblearnfrom sklearn.datasets import make_classification

X, y = make_classification(
n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=100, random_state=10
)

df = pd.DataFrame(X)
df['target'] = y
df.target.value_counts().plot(kind='bar', title='Count (target)');

We will also create a 2-dimensional plot function, plot_2d_space, to see the data distribution using PCA (Because the dataset has many dimensions and our graphs will be 2D, we will reduce the size of the dataset using Principal Component Analysis (PCA)

def plot_2d_space(X, y, label='Classes'):   
colors = ['#1F77B4', '#FF7F0E']
markers = ['o', 's']
for l, c, m in zip(np.unique(y), colors, markers):
plt.scatter(
X[y==l, 0],
X[y==l, 1],
c=c, label=l, marker=m
)
plt.title(label)
plt.legend(loc='upper right')
plt.show()
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X = pca.fit_transform(X)

plot_2d_space(X, y, 'Imbalanced dataset (2 PCA components)')

Resampling

You can change the dataset that you use to build your predictive model to have more balanced data.This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

  • Over-sampling: You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or
  • Under-sampling: You can delete instances from the over-represented class, called under-sampling.
Sampling for imbalanced data set

A. Under-sampling

It aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.

Advantages

  • It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

Disadvantages

  • It can discard potentially useful information which could be important for building rule classifiers.
  • The sample chosen by random under sampling may be a biased sample. And it will not be an accurate representative of the population. Thereby, resulting in inaccurate results with the actual test data set.
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_rus, y_rus, id_rus = rus.fit_sample(X, y)


plot_2d_space(X_rus, y_rus, 'Random under-sampling')
Random Under Sampling

B. Random over-sampling

It increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample.

Advantages

  • Unlike under sampling this method leads to no information loss.
  • Outperforms under sampling

Disadvantages

  • It increases the likelihood of overfitting since it replicates the minority class events.
  • It increases the size of the training set and the time to build a classifier.
  • Furthermore, in case of decision tree learning, the decision region for the minority class becomes very specific through the replication of the minority class and this causes new splits in the decision tree, which can lead to overfitting.
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_ros, y_ros = ros.fit_sample(X, y)

print(X_ros.shape[0] - X.shape[0], 'new random picked points')

plot_2d_space(X_ros, y_ros, 'Random over-sampling'
Random Over Sampling

C. Under-sampling: Tomek links

Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the better classification.

In this algorithm, we end up removing the majority element from the Tomek link, which provides a better decision boundary for a classifier.

from imblearn.under_sampling import TomekLinkstl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl= tl.fit_sample(X, y)
plot_2d_space(X_tl, y_tl, 'Tomek links under-sampling')
TOMEK Links Under Sampling

D. Under-sampling: Cluster Centroids

This technique performs under-sampling by generating centroids based on clustering methods. The data will be previously grouped by similarity, in order to preserve information.

Makes use of K-means to reduce the number of samples

from imblearn.under_sampling import ClusterCentroids

cc = ClusterCentroids()
X_cc, y_cc = cc.fit_sample(X, y)

plot_2d_space(X_cc, y_cc, 'Cluster Centroids under-sampling')
KMeans Under Sampling

E. Generate Synthetic Samples

A simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

E.1 Over-sampling: SMOTE

SMOTE (Synthetic Minority Oversampling Technique) consists of synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbours for this point. The synthetic points are added between the chosen point and its neighbours.

Advantages

  • Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than replication of instances
  • No loss of useful information

Disadvantages

  • While generating synthetic examples SMOTE does not take into consideration neighbouring examples from other classes. This can result in increase in overlapping of classes and can introduce additional noise
  • SMOTE is not very effective for high dimensional data
SMOTE Over Sampling
from imblearn.over_sampling import SMOTEX_resampled, y_resampled = SMOTE().fit_resample(X, y)plot_2d_space(X_resampled, y_resampled, 'SMOTE over-sampling')

E.2 Adaptive Synthetic Sampling (Borderline-SMOTE and ADA-SYN)

The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn.

As a result, the ADASYN approach improves learning with respect to the data distributions in two ways: (1) reducing the bias introduced by the class imbalance, and (2) adaptively shifting the classification decision boundary toward the difficult examples.

from imblearn.over_sampling import ADASYNX_resampled, y_resampled = ADASYN().fit_resample(X, y)plot_2d_space(X_resampled, y_resampled, 'ADASYN over-sampling')
ADASYN Over sampling

F. Over-sampling followed by under-sampling

We previously presented SMOTE and showed that this method can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the space resulting from over-sampling.

In this regard, Tomek’s link is one of the cleaning method that has been a problem solver.

from imblearn.combine import SMOTETomeksmt = SMOTETomek()
X_smt, y_smt = smt.fit_sample(X, y)
plot_2d_space(X_smt, y_smt, 'SMOTE + Tomek links')

Important Points to Note

  • Both SMOTE and ADASYN use the KNN algorithm to generate new samples
  • The other SMOTE variants and ADASYN differ from each other by selecting the sample ahead of generating the new samples.
  • SVMSMOTE — uses an SVM classifier to find support vectors and generate samples considering them. Note that the Cparameter of the SVM classifier allows to select more or less support vectors.
  • KMeansSMOTE — uses a KMeans clustering method before to apply SMOTE. The clustering will group samples together and generate new samples depending of the cluster density.
  • All algorithms can be used with multiple classes as well as binary classes classification
  • When dealing with mixed data type such as continuous and categorical features, none of the presented methods (apart of the class RandomOverSampler) can deal with the categorical features. The SMOTENC is an extension of the SMOTE algorithm for which categorical data are treated differently.

11. Integration of sampling and boosting-SMOTEBoost, RUSBoost

SMOTEBoost is an oversampling method based on the SMOTE algorithm (Synthetic Minority Oversampling Technique). SMOTE uses k-nearest neighbours to create synthetic examples of the minority class. SMOTEBoostthen injects the SMOTE method at each boosting iteration. The advantage of this approach is that while standard boosting gives equal weights to all misclassified data, SMOTE gives more examples of the minority class at each boosting step.

RUSBoost achieves the same goal by performing random undersampling (RUS) at each boosting iteration instead of SMOTE.

12. One interesting approach to solving the imbalance problem is to discard the minority examples and treat it as a single-class (or anomaly detection) problem. Isolation Forests that attempted to identify anomalies in data by learning random forests and then measuring the average number of decision splits required to isolate each particular data point. The resulting number can be used to calculate each data point’s anomaly score, which can also be interpreted as the likelihood that the example belongs to the minority class. Indeed, the authors tested their system using highly imbalanced data and reported very good results. Nearest Neighbour Ensembles as a similar idea that was able to overcome several shortcomings of Isolation Forests.

I Hope that this article will help you to understand and use best practices to handle imbalanced data sets.

Yet nothing is comparable to ‘Experience and Learn’. Try different things and you never know which will actually work for you.

Happy Reading and Keep Learning….

--

--