Surviving in a Random Forest with Imbalanced Datasets
CONTRIBUTORS: Bilal Hussain, Kyoun Huh, Hon Wing Eric Chan, Sakina Patanwala
This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.
Have you ever ended up with a machine learning model whose overall accuracy is a good 98% in the first go? How is that possible? Is it that easy? I wish. Most likely, that is the case because you are dealing with an imbalanced dataset, and you are not even aware of it. How do you deal with this kind of dataset? Which algorithm do you use and why? What if you are instructed to only use one algorithm in particular, and it happens to be a Random Forest Classifier?! Well, we have your back here.
Encountering imbalanced datasets in real-world machine learning problems is a norm, but what exactly is an imbalanced dataset?
Let us understand that with an example. Fraud detection is a recurrent problem that individuals try to solve in machine learning. However, when training a fraud model, we will discover that standard transactions are far higher than fraudulent ones, creating an imbalance in the dataset. Imbalance means that the number of points for different classes in the dataset is different. If there is a 1:9 imbalanced ratio (IR) between the data points for each class, then the imbalance is high and badly affects the model.
All this hype about a random forest algorithm, but what exactly is it?
Random forest algorithm is one of the most popular and potent supervised machine learning algorithms capable of performing both classification and regression tasks. This algorithm creates a forest with several decision trees. The more trees in the forest, the more robust the prediction is and hence higher accuracy to model multiple decision trees. To create the forest, one will use the same method of constructing the decision tree with the information gain or Gini index approach amongst other algorithms. In random forests, we grow multiple trees instead of a single tree in the model to classify a new object. Based on the attributes, each tree gives a classification, and the forest chooses the class with the most votes as the classifier. In the case of regression, it takes the average of the outputs by different trees.
But why use imbalanced datasets along with a random forest classifier in the first place?
Most of the classification problems are imbalanced. Therefore, standard classification algorithms do not work as they try to minimize the error rate rather than focus on the minority class, giving bias classification. Random forest is an ideal algorithm to deal with the extreme imbalance owing to two main reasons. Firstly, the ability to incorporate class weights into the random forest classifier makes it cost-sensitive; hence it penalizes misclassifying the minority class. Secondly, it combines the sampling technique and ensemble learning, therefore, downsampling the majority class and growing trees on a more balanced dataset.
There are several other advantages of using a random forest classifier with imbalanced datasets. The algorithm is a strong modelling technique and is much more sturdy than a single decision tree. The aggregation of several trees limits the possibility of overfitting and miscalculations due to bias and, in return, capitulates handy results. Random forest classifier handles the missing values and maintains accuracy for missing data when a large proportion of the data is missing. It has the power to control large data sets with higher dimensionality.
How about using and getting familiar with a real-world imbalanced dataset?
Let us use a real-life imbalanced dataset from an insurance company that provides health insurance to its customers. The goal is to predict whether an old customer would be interested in purchasing vehicle insurance. The data contains information about customer’s demographics, vehicles, and policies. The original dataset has an imbalanced ratio of 1:5 between the target labels. After initial cleaning and preprocessing, we downsampled the data to increase the imbalance ratio to approximately 1:10.
It is evident from the above bar chart that there is a high imbalance in the dataset as 91% of people are not interested in purchasing vehicle insurance.
Oh, to be able to find the most fitting evaluation for a machine learning model!!! Oh, to use K-Fold Cross Validation or Stratified K-Fold Cross Validation?!
Classification model evaluation is challenging; the particular reason for this circumstance is that there is no way to tell if a model is a good fit until it is used, so estimating its performance on the already available data is vital. Commonly, if the dataset is large enough, then a test/train split can be used given that the splits contain the target classes in almost equal proportions. However, realistically, we rarely have large enough datasets that nullify test/train split effectiveness. To encounter this, we resort to resampling techniques like K-Fold Cross-Validation.
However, K-Fold Cross Validation is not suitable for handling imbalanced data because it randomly divides the data into k-folds. Folds might likely have negligible or no data from the minority class resulting in a highly biased model.
The solution is to use stratified sampling, ensuring splitting the data randomly and keeping the same imbalanced class distribution for each subset. The modified version of K-Fold i.e. stratified K-Fold Cross Validation necessitates the matching class distribution with the complete training dataset in each split.
Building the model is still doable, but how do we evaluate if we are training and testing it correctly?
To evaluate the model’s performance on the imbalanced dataset, we use some commonly used metrics like confusion matrix, precision, recall, f1-score, and PRC (Precision-Recall Curve).
A confusion matrix is a table that gives knowledge into the visualization of a predictive model of an algorithm as well as which classes are anticipated accurately, which erroneously, and what kind of errors are being made. This matrix compares the actual target values with those being predicted by the model.
The precision of a class characterizes how certain is the outcome when the model answers that a point has a place with that class. The recall of a class communicates how well the model can distinguish that class. The f1-score {(2×precision×recall / (precision + recall))} of a class is dependent on the mean value of the precision and recall of the class. It is calculated by incorporating the recall and precision of a particular class in one metric. The PRC is a plot of precision vs. recall for various breakpoint values and concentrates on the classifier’s performance for the smaller class.
Choosing the right metrics can be confusing when it comes to imbalanced data, and it is relatively essential to choose the right one for model evaluation. For further clarification, let us discuss why AUROC (Area Under the Receiver Operating Characteristic curve) and accuracy are not suitable metrics for imbalanced datasets.
AUROC is a scalar value that ranges from 0 to 1. As the name suggests, the AUROC is a probability curve that plots the True Positive Rate against the False Positive Rate at various breakpoint values. However, it is sensitive to class imbalance. It will strongly affect the AUROC value, especially when the size of the minority class is small, resulting in AUROC being optimistic and not suitable for imbalanced data.
The overall accuracy is also not a useful metric because an inconsequential random forest classifier that predicts every positive point as the majority class can quickly achieve very high accuracy and makes the model biased towards the more significant class.
So which algorithms, in particular, are rescuing us from the imbalance these classes are causing? Here are three random forest models that we will analyze and implement for maneuvering around the disproportions between classes:
1. Standard Random Forest (SRF)
As discussed earlier, a random forest consists of numerous decision trees. Each decision tree in the forest is created from randomly selected bootstrap samples, which are aggregated to make a prediction called ‘Bagging.’ This step allows the forests’ diversity, resulting in substantial performance improvement in its accuracy. Let us get to coding this model, shall we?
We applied stratified K-Fold Cross Validation to evaluate the model by averaging the f1-score, recall, and precision from subsets’ statistical results. Stratified K-Fold Cross Validation minimized the outcome’s variance due to the algorithm’s random probability distribution nature. We then focus on achieving the right balance between recall and precision when comparing the following models.
For SRF, we get a 0.102 and 0.365 score for recall and precision, respectively. We see that SRF with the dataset has a low false positive rate and relatively high false negative rate. A recall score of 0.102 shows that SRF has failed to predict ~90% of the minority class (customers who will buy the insurance). The 0.365 precision illustrates that 36.5% of predictions of the minority class are correct. SRF does not perform well in general as the f1-score is as low as 0.160. Now let us dig deeper into the confusion matrix result to see where the problem lies.
Inspecting the confusion matrix, most of the predicted results are false negative and true negative. This result has a strong bias in predicting the result as the majority class since there are many more training data samples than the minority class.
However, similar to the most commonly used classification algorithms, the random forest classifier is also designed to minimize the overall error rate. Thus, we cannot expect a decent result in predicting minority class values with an imbalanced dataset.
How do we overcome this shortcoming? Hi, Balanced Random Forest!
2. Balanced Random Forest (BRF)
SRF has its restrictions with imbalanced classes because it uses a bootstrap sample of the training set to form each tree. In imbalance learning, the likelihood of bootstrap samples containing few or none of the minority classes increases notably, resulting in a model being a lousy predictor of the minority class. To overcome this limitation, it is crucial to make class priors equal, either by downsampling or oversampling. Hence, BRF does this by iteratively drawing a bootstrap sample with equal proportions of data points from both the minority and the majority class. The code below shows the implementation of BRF for the imbalanced insurance dataset.
Here, we used the ‘BalancedRandomForestClassifier’ class from the ‘imbalanced-learn’ library to implement the majority class’s down-sampling.
Using BRF, we observe an improvement in the f1-score by around 0.34. Also, the recall has improved by about 0.8. Here, the model works on predicting actual customers in a minority class! Wait a minute. The score of precision dropped by 0.105?
From the confusion matrix, we notice that the false positive class also increased by over 1000, implying that the model needs further improvement. Unfortunately, BRF was not much beneficial in improving the class imbalance issue with this dataset in particular. Now let us get our hands even dirtier and try the SMOTE method with a random forest algorithm.
3. SMOTE (Synthetic Minority Oversampling Technique) using Standard Random Forest
SMOTE (Synthetic Minority Oversampling Technique) is one of the oversampling techniques that use a minority class to generate synthetic samples. In consequence, it overcomes the overfitting problem raised by random oversampling. SMOTE works by generating instances that are close in feature space, using interpolation between positive cases that are close to each other. It randomly selects a minority class instance and finds its nearest neighbour. Then it creates synthetic models by randomly choosing one of the neighbours and forms a line segment in the feature space. It then generates synthetic instances of the two selected instances as convex combinations. It is time to see SMOTE in action using SRF.
After we used SMOTE to oversample the minority class, we similarly trained our model as the Standard Random Forest. An f-1 score of 0.939 is achieved, which is a good outcome because we attain high numbers for both recall and precision via the f-1 score. Let us check out the confusion matrix to see how the implementation of SMOTE has influenced it:
From the confusion matrix, we perceive true positive rates have improved radically. The true positive rate improved by 522 points (594–72) without increasing the number of false positives. Unlike BRF, SMOTE creates a sufficient amount of synthesis data points of the minority class (customers who will buy the insurance) in SRF. Therefore, it does not increase the number of false positive classes and produces a non-biased model.
Precision-Recall Curves? A good enough estimator for imbalanced classification?
Now, let us make a direct comparison between all the three above models that use the same insurance imbalanced dataset. Firstly, what is this blue dashed line at the bottom labelled ‘no-skill’? It is a classifier that cannot distinguish amongst classes and randomly predicts a class — the no-skill line changes based on the model’s class distribution. The value of the no-skill line depends on the ratio of true cases in the dataset. The precision-recall curves show that SRF and BRF have similar values since there is an undeniable overlap. BRF gets a slightly better precision score between recall around 0.6 to 0.8. SMOTE using Standard Random Forest has the best skill for the imbalanced dataset as the PRC almost does not make any prediction errors for the positive class.
Final thoughts?!
Overall, we have talked about the random forest algorithm, metrics for imbalanced dataset evaluation, and different models to deal with an imbalanced dataset. Standard Random forest is not a suitable model when it comes to an imbalanced dataset. Balanced Random Forest improved the prediction of the minority class but also misclassified the false positive rate. In the end, using SMOTE along with Standard Random Forest gave us the best result among all methods. Along with SMOTE, there are some other tactics that one can explore, but it all depends on the dataset and the cases.
In the world of imbalanced data, we hope to let Random Forest Classifiers pave the way for you and help you harness the algorithm’s true potential. Random forest is an exceptionally good algorithm to work with; knowing its usefulness with imbalanced data is undoubtedly an excellent skill to have for a data science enthusiast.
Thank you for your time to read! You can find the full code and execution steps with more details on Github. Looking forward to hearing feedback and discussing more in the responses. Feel free to connect with us on LinkedIn! :)
References:
[1] https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/
[2] https://www.kaggle.com/arashnic/imbalanced-data-practice?select=aug_train.csv
[3] https://machinelearningmastery.com/bagging-and-random-forest-for-imbalanced-classification/
[6] https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
[8]https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
[9]https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
[10]https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
[11] Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.