Imbalanced Data
Some techniques to manage imbalanced data in python
Introduction
More often than not, we are going to deal with data sets that have a unequal distribution between the classes, which is also known as imbalanced data. Why should we pay a close attention to this? Because when we try apply a classification model to this kind of data, it can affect the performance of the classifier since there is a presence of underrepresented data. In this article we are going to evaluate some of the methods used to deal with imbalanced data, just as evaluation metrics, sampling techniques and cost sensitive methods.
By evaluation metrics we refer to methods that assess how well the model classifies the samples. The metrics we are going to apply and describe in more detail below are: accuracy, precision, recall, F1 score, G mean and Matthewus correlation coefficient.
On the other hand, by sampling methods we refer to the techniques that modifies the structure of the data by balancing the distribution of the classes. In other words, most of these techniques artificially increase the number of the minority samples (over-sampling) or decrease (under-sampling) the number of majority samples. The methods we are going to apply are: SMOTE, ADASYN, Neighbourhood Cleaning Rule, One Sided Selection, SMOTEENN and SMOTE+Tomek.
Last but not least, instead of modifying the class distribution of the training data, we are going to see the effect of changing the way a classifier penalizes every time it wrongly classifies an underrepresented class. This is also known as cost-sensitive learning, which states that the model is going to increase the penalization weight in the minority classes that are misclassified.
Data
The data set used is the Banking Marketing Data that can be found in UCI, and the full code used in this article is available on my GitHub.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrixfrom sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformerfrom sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_splitfrom imblearn.pipeline import Pipelinefrom sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score, matthews_corrcoef
from imblearn.metrics import geometric_mean_scorefrom sklearn.ensemble import RandomForestClassifier#Sampling methods
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import NeighbourhoodCleaningRule,OneSidedSelection
from imblearn.combine import SMOTEENN, SMOTETomekdata=pd.read_csv(r'bank-additional\bank-additional\bank-additional-full.csv',sep=";")train, test = train_test_split(data, test_size=0.2, random_state=0, stratify=data[['y']])#As a suggestion in https://archive.ics.uci.edu/ml/datasets/bank+marketing, the column "duration" is dropped.
train=train.drop(columns='duration')
test=test.drop(columns='duration')train_x=train.iloc[:,0:-1]
train_y=pd.DataFrame(train['y'])
train_y.y.replace('no',0,inplace=True)
train_y.y.replace('yes',1,inplace=True)test_x=test.iloc[:,0:-1]
test_y=pd.DataFrame(test['y'])
test_y.y.replace('no',0,inplace=True)
test_y.y.replace('yes',1,inplace=True)numerical_ix = train_x.select_dtypes(include=['int64', 'float64']).columns
categorical_ix = train_x.select_dtypes(include=['object', 'bool']).columnskfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=100)scoring = {"Acc":"accuracy",
'F1': 'f1',
'Prec': 'precision',
'Recall':'recall',
'MC':make_scorer(matthews_corrcoef),
'GM':make_scorer(geometric_mean_score)}rf=RandomForestClassifier(random_state=0)
Evaluation metrics
One of the metrics most used when working with classification problems is the accuracy, which is equal to:
However, when we are dealing with imbalanced data, using accuracy to evaluate the performance of a classifier can be misleading because it provides an overoptimistic estimation of the model’s ability on the majority class.
To demonstrate this statement let’s visualize the confusion matrix after predicting the classes of the imbalanced dataset:
t0 = [('cat', OneHotEncoder(handle_unknown = "ignore"), categorical_ix)]
col_transform0 = ColumnTransformer(transformers=t0)pipeline = Pipeline(steps=[('prep', col_transform0),('m',rf)])
cv_results_none=cross_validate(pipeline, train_x, train_y.values.ravel(), scoring=scoring, cv=kfold)pipeline.fit(train_x,train_y.values.ravel())
predict_y=pipeline.predict(test_x)
plot_confusion_matrix(pipeline, test_x,test_y,display_labels=["No","Yes"], cmap=plt.cm.Blues) ;
By looking at the confusion matrix above we see that 96% of the majority class is correctly classified, and 77% of the minority class is misclassified. However, the total accuracy is 88.68%.
Due to the drawbacks of accuracy, there are other popular evaluation metrics used when dealing with imbalanced data:
1.Precision: it is a metric that calculates the percentage of correct predictions for the positive class. It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.
2. Recall: it calculates the percentage of correct predictions for the positive class out of all positive predictions that could be made. It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that could be predicted.
3. F1 score: it is calculated as the harmonic mean of precision and recall, giving each the same weighting. It allows a model to be evaluated taking both the precision and recall into account using a single score.
4. Geometric mean: This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced. It the squared root of the product of the sensitivity/recall and specificity/true negative rate.
5. Matthews correlation coefficient: it is a correlation coefficient between the observed and predicted label which value is between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.
Methods/Techniques:
Below you will see a description of each method used in this article to deal with the imbalanced data set and the result of the evaluation metrics applied:
1.Classification with original data set (imbalanced data): First of all, we are going to compute the evaluation metrics using the original data:
results = pd.DataFrame(columns=scoring.keys())
results.loc['No sampling']= [np.mean(cv_results_none['test_{}'.format(score)]) for score in scoring.keys()]
results
2. Synthetic Minority Oversampling Technique (SMOTE): This is an Oversampling technique which increases the amount of minority classes by creating artificial examples using the K Nearest Neighbor (KNN) algorithm. Basically speaking, this method takes all the minority samples, computes the KNN for each one (the default K is 5), randomly choose one or more of the K nearest neighbors computed, then take the difference of the sample under consideration and its nearest neighbor and multiply this difference by a random number between 0 and 1, and finally add it to the original sample under consideration.
t = [('cat', OneHotEncoder(handle_unknown = "ignore"), categorical_ix), ('num', StandardScaler(), numerical_ix)]
col_transform = ColumnTransformer(transformers=t)pipeline = Pipeline(steps=[('prep', col_transform),('imbalance',SMOTE()) ,('m',rf)])
cv_results_smote=cross_validate(pipeline, train_x, train_y.values.ravel(), scoring=scoring, cv=kfold)results = pd.DataFrame(columns=scoring.keys())
results.loc['SMOTE']= [np.mean(cv_results_smote['test_{}'.format(score)]) for score in scoring.keys()]
results
3. Adaptive Synthetic (ADASYN): This is another oversampling technique that uses KNN. However, instead of creating an equal number of synthetic samples as SMOTE does, ADASYN computes a density distribution function and uses it as a criterion to decide the number of artificial samples that need to be generated for each minority sample of the original data. In other words, it applies a weighted distribution for different minority class examples according to their level of difficulty in learning compared to other minority samples.
pipeline = Pipeline(steps=[('prep', col_transform),('imbalance',ADASYN()) ,('m',rf)])
cv_results_adasyn=cross_validate(pipeline, train_x, train_y.values.ravel(), scoring=scoring, cv=kfold)results = pd.DataFrame(columns=scoring.keys())
results.loc['ADASYN']= [np.mean(cv_results_adasyn['test_{}'.format(score)]) for score in scoring.keys()]
results
4. Neighbourhood Cleaning Rule: This is an under-sampling method, that computes KNN for each the majority and minority classes. It basically drops the examples of the majority class when it’s K nearest neighbors (default K is 3) belong to other minority class. Furthermore, if the K neighbors computed for the minority class belong to the majority class, then those nearest neighbors that belong to the majority class are removed.
pipeline = Pipeline(steps=[('prep', col_transform),('imbalance',NeighbourhoodCleaningRule()) ,('m',rf)])
cv_results_ncr=cross_validate(pipeline, train_x, train_y.values.ravel(), scoring=scoring, cv=kfold)results = pd.DataFrame(columns=scoring.keys())
results.loc['NCR']= [np.mean(cv_results_ncr['test_{}'.format(score)]) for score in scoring.keys()]
results
5. One Sided Selection: This technique method combines two under sampling which are Tome Links and Condensed Nearest Neighbor Rule in order to remove the samples from the majority class that are too close and too far way the decision line or borderline.
pipeline = Pipeline(steps=[('prep', col_transform),('imbalance',OneSidedSelection()) ,('m',rf)])
cv_results_oss=cross_validate(pipeline, train_x, train_y.values.ravel(), scoring=scoring, cv=kfold)results = pd.DataFrame(columns=scoring.keys())
results.loc['OSS']= [np.mean(cv_results_oss['test_{}'.format(score)]) for score in scoring.keys()]
results
6. SMOTEENN: This technique is the result from combining an over sample and an under sample method. The method begins by oversampling using SMOTE and then uses the under-sampling technique called Edited Nearest Neighbor Rule (ENN), which removes any example whose class label differs from the class of at least two of its three nearest neighbors.
pipeline = Pipeline(steps=[('prep', col_transform),('imbalance',SMOTEENN()) ,('m',rf)])
cv_results_smeenn=cross_validate(pipeline, train_x, train_y.values.ravel(), scoring=scoring, cv=kfold)results = pd.DataFrame(columns=scoring.keys())
results.loc['SMEENN']= [np.mean(cv_results_smeenn['test_{}'.format(score)]) for score in scoring.keys()]
results
7. SMOTE + TOMEK: This is another combination of an over-sampling method, in this case SMOTE, and an under-sampling which is Tome Links. After creating synthetic samples for the minority class, it removes the samples of the majority class which are too close of the borderline or to the minority samples.
pipeline = Pipeline(steps=[('prep', col_transform),('imbalance',SMOTETomek()) ,('m',rf)]) cv_results_smt=cross_validate(pipeline, train_x,
train_y.values.ravel(), scoring=scoring, cv=kfold) results = pd.DataFrame(columns=scoring.keys()) results.loc['SMT']= [np.mean(cv_results_smt['test_{}'.format(score)]) for score in scoring.keys()] results
8. Cost sensitive learning: Another approach to deal with imbalanced data is by changing the structure of the classifier in other to make it suitable to handle this kind of data. Since the aim of a model is to minimize the error, cost sensitive learning introduces a higher cost for the misclassification of samples from the positive class (minority) with respect to the negative ones (majority).
In the code below you will see the hyperparameter class_weight=balanced. The term “balanced” means that the model will adjust the weights of the values in “y” inversely proportional their frequencies: n_samples / (n_classes * np.bincount(y)).
pipeline = Pipeline(steps=[('prep', col_transform0),('m',RandomForestClassifier(class_weight='balanced'))])
cv_results_cost=cross_validate(pipeline, train_x, train_y.values.ravel(), scoring=scoring, cv=kfold)results = pd.DataFrame(columns=scoring.keys())
results.loc['cost_sen']= [np.mean(cv_results_cost['test_{}'.format(score)]) for score in scoring.keys()]
results
Final comparison among methods
In the table below you will see the evaluation metrics for all the methods explained above:
cv_result=[cv_results_none,cv_results_smote,cv_results_adasyn,cv_results_ncr,
cv_results_oss,cv_results_smeenn,cv_results_smt,cv_results_cost]methods=["No sampling","smote","adasyn","ncr","oss","smeenn","smt","Cost sensitive"]
results = pd.DataFrame(columns=scoring.keys())for i,j in zip(methods,cv_result):
results.loc[i]= [np.mean(j['test_{}'.format(score)]) for score in scoring.keys()]results
Even though One Sided Selection produced the best accuracy score (89.11%), we can conclude that the method that best handled this imbalanced data set was the SMOTEENN because it has the highest F1 score (47.94%), the highest Matthews Coefficient (40.88%) and Geometric Mean score (72.56%).
This was just a simple introduction to some of the methods applied to deal with imbalanced data, however, it is important to remember that their is no ideal method to solve a problem in machine learning since it is most about trial and error. I really hope that you enjoyed and learned new things by reading this article as much as I did.
Reference
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera — Learning from Imbalanced Data Sets-Springer International Publishing (2018)
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20–29.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322–1328). IEEE.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.