Addressing the Data Imbalance Problem in Healthcare
$250B of all US healthcare spendings could be for digitization, according to McKinsey estimates. That’s roughly 20% of all estimated Medicare, Medicaid, and commercial outpatient, office, and home health spending for 2020.
Even while only some 7% of US healthcare and pharma companies have gone digital, there is already an explosion in the number of data records — EHRs, Physician Referral Documents, Discharge Summary, and other clinical information. And with this data explosion, healthcare service providers have begun significant investments in data analytics and cognitive computing to leverage the vast digital data and provide personalized and effective patient care.
For example, predicting Hospitalization, ICU Readmission, and extracting clinical parameters from Medical Practitioners’ notes, etc are some use-cases aimed at providing better healthcare. These use-cases are typically ML Classification problems. And such classification tasks suffer from the problem of data imbalance.
Data Imbalance refers to a classification predictive modeling problem where the number of examples in the training dataset for each class label is not balanced. That is, where the class distribution is not equal or close to equal and is instead biased or skewed.
Now let me explain this in English. Let’s consider a multispecialty hospital, it will naturally have plenty more records for common ailments like Colds/Flu, Allergies, fever induced by Lower Respiratory infections, and the like. As opposed to clearly rarer ailments like Cardiac Amyloidosis. Now when the Data Scientists begin processing this data, they will obviously have far more records for common ailments than rarer ailments. This is what the Data Science community has come to call Data Imbalance. The future simply is going to belong to those healthcare service providers who overcome this bias.
We come across this problem very frequently while working on healthcare projects. And we use several techniques to deal with Data Imbalance. In this blog, I have listed and explained the various Data Science techniques we use, and their advantages and disadvantages.
Prediction using the model trained with imbalanced classes leads to biased predictions and leans asymmetrically towards the majority class. Therefore, we will look at strategies to overcome this bias.
In this strategy, majority class samples are randomly removed from the dataset until a favorable distribution of the majority and minority classes is achieved. The optimal proportion of the majority to the minority class is found by implementing a trial and error approach.
For one of our healthcare clients, the business objective was to predict patients prone to ICU Readmission, which would have a direct impact on enhancement in workforce planning and optimized operations. Based on the ICU patient data, an AI model was built and the readmission rate was predicted. The accuracy of the model was observed to be 98% with the data as-is. However, the sensitivity was observed to be only 35%, which indicated that the model was not able to predict true positives more accurately. So we used the Random Undersampling Technique and undersampled the majority class (i.e., the data of patients who were not re-admitted) by 50% of the data and re-built the AI model. The sensitivity of the model was observed to have increased from 35% to 60%.
Advantage: Random Undersampling helps with a faster model runtime and the memory occupied during the model training is less as we have undersampled the data. This is handy when the data size is huge.
Disadvantage: However, discarding multiple samples of the majority class can be highly problematic, as the loss of such data can make the decision boundary between the minority and majority instances harder to learn, resulting in a loss in classification performance.
In this strategy, the samples belonging to the majority class are retained as-is and the data samples belonging to the minority class are duplicated several times. This is done in order to maintain a uniform distribution of the classes in the dataset, which could subsequently be used to train the model.
The requirement from one of our healthcare clients was to identify disease from the patient’s health data. An exploratory analysis of the dataset suggested that the data had a diagnosis for 12 anatomical regions of the human body and diagnosis was highly skewed towards 3 such anatomical regions. The dataset falling under the remaining 9 regions was very minimal and hence the distribution of data was not uniform in nature. Therefore, we decided to perform oversampling of the minority classes in this instance. We retained the records on the 3 subclasses as-is and increased the record size of the remaining sub-classes so that the data gets proportioned in a balanced manner across the subclasses.
By oversampling the records, the sensitivity increased from 42% to a whopping 75% when the model was built on the data as-is versus the oversampled data.
The advantage of using this strategy is that it leads to no information loss and outperforms the undersampling strategy in the majority of the scenarios.
The disadvantage is that the model starts memorizing the data and this could increase the likelihood of model overfitting and hence leading to abrupt degradation in predictions.
Synthetic Minority Oversampling Technique (SMOTE)
To overcome the disadvantage of oversampling technique, a strategy called SMOTE was devised. In this methodology, a subset of the minority class is taken as an example and synthetic samples are created. These samples are then included in the dataset for the model to train on.
A leading US Healthcare Provider had a problem identifying rare diseases. Based on EDA, we observed that the dataset contains 1% of records falling under the minority class (rare disease) with the remaining records belonging to the majority class. In this scenario, the data imbalance was countered using the SMOTE strategy. In this process, the records belonging to the minority classes were taken and synthetic samples were created several times over (15–20 times), depending on the proportionality of the dependent variable. Thus, the minority class proportion was increased from 2% to 20%. As a result, the sensitivity increased from 15% to 45%.
The advantages of this approach are that overfitting is mitigated because we are synthetically generating new data instead of replicating data done for oversampling, and there is no loss of information.
The disadvantage in this approach is that it is not effective for higher-dimensional data and while generating synthetic records, it does not consider the neighbor examples from other classes therefore the strategy is often prone to producing noise.
While the above methods work on the data and enhance the proportions of the minority class by oversampling or decrease the proportion of the majority class by undersampling, an appropriate choice of modeling technique could also lead to improvement in results. The modeling techniques that are often used in such cases are explained below.
Algorithmic Ensemble Techniques
The objective of ensemble techniques is to improve the model performance. In other words, to improve the model outcomes for classification tasks. Now, let us explore the methods for ensembling.
Bagging is a bootstrap technique that creates ’n’ different bootstrapped samples from the data and trains the model on each of these ’n’ samples. It then aggregates prediction outcomes at the end. Bagging reduces overfitting and allows for bootstrapped samples to be replaced.
The advantages of the Bagging Technique are that it improves the model accuracy, reduces variance, and lowers the misclassification rate.
The disadvantage is that bagging bad classifiers can degrade the model’s performance.
Boosting is a technique that makes use of weak learners, combines them in order to form a strong learner, and makes accurate predictions. It starts from a weak learner on the training data. Weak learners are the ones for which a model prediction is only slightly better than the average learner.
In the next iteration, new learners place more emphasis on those that were learned incorrectly in the previous round. Let us look at commonly used boosting techniques.
Short for Adaptive Boosting, Ada Boost is a technique that creates an accurate prediction rule by combining weak learners and inaccurate rules. Each of the classifiers is iteratively trained to accurately classify data in every round that was misclassified in the previous round.
For a learned classifier to make a proper prediction, it should meet the following:
- The classifier should have low training error on the training instances.
- The classifier should be trained on adequate training data.
This technique assumes that each weak hypothesis has an accuracy better than a random guess. After each round, it focuses on data that is hard to classify, and that is weighted. After each iteration, the misclassified data get a higher weightage and the accurately classified data carry a lower weightage.
The advantage of this technique is that it is easy to implement and can be generalized for any type of classification problem. It is also not prone to overfitting.
The disadvantage of this technique is that it is sensitive to noisy data and outliers.
XGBoost is also known as the Extreme Gradient Boosting technique. It works on the optimization algorithm called Gradient Descent, whose objective is to minimize the loss function. In this technique, the models are trained sequentially with the optimization algorithm applied on each of these models, and Decision Trees are used as weak learners in Gradient Boosting.
Gradient Boosting builds the first learner on the training data to predict outcomes, calculates the loss (the difference between the predicted and the original outcomes), and uses this loss value to build an improved learner in the next stage. In each step, the residual of the loss function is calculated using the Gradient Descent Method. And the new residual becomes the target variable for the next iteration.
The advantages of this technique are numerous. It implements parallel processing and it gives Data Scientists the flexibility to choose the optimization objective and evaluation criteria. It also has a built-in method to handle missing values.
The disadvantage of this technique is that it is hard to fit the model as opposed to random forests.
Model Concurrency is an effective method that combines multiple classifiers and class imbalance approaches, primarily to improve the classification performance on several applications.
For the rare disease prediction problem, SMOTE was used and sensitivity improved from 15% to 45%. With the sensitivity still low, we used the model concurrency approach and used decision trees and Multi-Layer Perceptrons as base classifiers to build an ensemble. We also used sub-sampling methods to deal with the class imbalance. This improved the sensitivity from 45% to 68%.
Python Library for Data Imbalance
Imbalanced-learn is a package in Python that helps overcome the data imbalance issue. The built-in methods for this package are listed below.
- Random majority under-sampling with replacement
- Extraction of majority-minority Tomek links 
- Under-sampling with Cluster Centroids
- NearMiss-(1 & 2 & 3) 
- Condensed Nearest Neighbour 
- One-Sided Selection 
- Neighborhood Cleaning Rule 
- Edited Nearest Neighbours 
- Instance Hardness Threshold 
- Repeated Edited Nearest Neighbours 
- AllKNN 
- Random minority over-sampling with replacement
- SMOTE — Synthetic Minority Over-sampling Technique 
- SMOTENC — SMOTE for Nominal Continuous 
- bSMOTE (1 & 2) — Borderline SMOTE of types 1 and 2 
- SVM SMOTE — Support Vectors SMOTE 
- ADASYN — Adaptive synthetic sampling approach for imbalanced learning 
- KMeans-SMOTE 
Over-sampling followed by under-sampling
Ensemble classifier using samplers internally
In this article, we have narrated how we overcame the data imbalance problem, especially in healthcare AI use-cases. Sensitivity has been used as the metric to gauge a model’s performance and these techniques helped improve sensitivity significantly. However, it must also be noted that a lower number of false positives is a necessity in this domain, and care must be taken to have control over the percentage of false positives a model would predict.
To handle data imbalance, we try out different sampling mechanisms and choose the best performing one specific to a given health care use case. While trying out the different sampling techniques mentioned above, we train multiple models with different combinations of majority and minority class data. From these models, we pick the best-performing one and deploy it in production. We need to be cognisant of the trade-off between sensitivity and false positivity rate while choosing the model.
In our experience with healthcare use cases, we find that under-sampling works better when compared to over-sampling or SMOTE. There is no one-size-fits-all model. You have to try different approaches for your use case and decide based on the different accuracy metrics.
We hope you found this article interesting and useful. If you have comments or suggestions or questions, please feel free to share them with us.