This article explains how you can reduce the bias of a neural network trained for medical diagnosis on a data set having a low prevalence of the disease.
In medical diagnosis studies, Imbalanced Classification is a common challenge. For almost any disease, a medical laboratory has more patients not having rather than having it. In the training set, having the naturally occurring prevalence of patients with the disease of interest would cause any loss function based model to be skewed towards negative classification¹ as is it is not going to receive enough loss for patients with the disease.
In the test set, having the same patients distribution would cause the test to report overly optimistic accuracy only slightly affected by the incorrect classification of patients with the disease. Out of all statistical measures² commonly used for measuring binary classification performance, only recall is not going to yield an overly-optimistic value.
The situation is also aggravated by the fact that we care more about eliminating Type II errors (false negatives) than Type I (false positives). In medical practice, it is almost always better to run some additional tests for a patient who doesn’t have the disease rather than leave untreated the patient who does.
Here’s an example of an imbalanced dataset (after applying principal component analysis to its features) containing 141 urine samples for patients with Bacteriuria infection and 663 samples for patients not having the infection:
In this particular case, we want the classifier to narrow down the green cluster as much as possible and we would tolerate some significant amount of green dots outside of this cluster. But there’s just not enough red dots in the borderline region for a model to make such generalization.
In the picture below I outlined the presumed normal region in blue. We want all the samples outside this region to be classified as red and we can afford misclassification of all the green points outside this region. It is better to be extra cautious here because every misclassified red point is an untreated infection.
Such classification would cause a loss function to increase from its optimal value that can be achieved by a model because there are more green than red dots in the borderline region.
Let’s train a simple dense neural network with 3 layers containing 3 tanh, 2 tanh, and 1 output sigmoid neurons. The model is going to use Binary Crossentropy as its loss function and we will train it using Stochastic Gradient Descent for 100 epochs on the data set above.
As forecasted, the neural network reports a relatively high accuracy value of 0.92, but a low recall value of just 0.60.
The high accuracy value must not deceive you. The neural network is doing a poor job misclassifying a lot of red dots. The high accuracy value can be just attributed to the high amount of properly classified greed dots at the bottom of the normal region.
The way to approach this problem is called resampling and can be done by either undersampling the majority class or oversampling the minority class.
Random undersampling might not work well in case when there is not enough data for the model to generalize.
A method called Tomek links³ can be used to identify borderline samples and thus remove the remaining samples as they are less important for generalization and testing.
Another idea called SHRINK system⁴ classifies the overlapping regions of positive and negative classes as positives arguing for undersampling by removing the negative examples from that region.
It works in some cases but might not work well with the Bacteriuria dataset above. As you can see from Picture #2 the high-dense green region outlined in orange overlaps with a moderately sparse region or red points. Removing green points from this region won’t increase the model’s recall significantly.
Oversampling with replacement
Oversampling with replacement can be done by just duplicating minority examples from the original data set to match the number of cases in the majority class. The problem is that this approach doesn’t provide any new information to the model during training, and thus it doesn’t necessarily lead to a better generalization.
The idea of synthetic oversampling comes from the fact that for a model it could be easier to particularize than generalize. Thus, a simpler generative model can be built to provide the necessary diversity for a more complex classifying model.
The main challenge in synthetic oversampling methods is that even though a safe area can be detected inside the minority class region with no majority samples in it, too many new cases there can actually decrease the model’s accuracy for borderline cases. The model is not going to receive enough loss from the borderline cases, and thus it will be less accurate in classifying them. It is the opposite of our goal.
Adding just borderline cases with some randomization can also be tricky as random noise can swap an example’s label according to some classifiers such as k nearest neighbors. Even though it cannot make an image of a cat to look like a more rare iguana, random noise can easily make a borderline-sick patient look borderline-healthy in the doctor’s eyes.
One of the successful approaches in synthetic oversampling is called SMOTE⁵ (Synthetic Minority Over-sampling Technique), where new samples are generated along the lines inside k nearest neighbors clusters. SMOTE shows improved results for medical diagnosis in several studies, for example, diagnosing diabetes⁶ and cancer⁷.
SMOTE implementation⁸ is available in Python⁹ and it even allows combining¹⁰ synthetic oversampling with undersampling for better results.
Here’s the same Bacteriuria data set as above but after applying SMOTE to it:
Red cluster boundaries are reinforced now and during the training, a model is going to receive enough loss from borderline red cases leading to a better generalization.
The neural network described above and trained on the data set with SMOTE samples reports a less optimistic accuracy value of 0.87 (previously 0.92) but a much better recall value of 0.75 (previously 0.60).
Another approach called Deep SMOTE¹¹ claims to improve SMOTE performance in terms of precision, F₁, and AUC by using a deep neural network for generating synthetic samples.
Imbalanced Classification is an important problem that needs to be addressed in Medical Diagnosis studies. Fortunately, the problem is well studied, and various successful methods are proposed and implemented to address it based on data interpolation, as well as deep learning.
¹ Nathalie Japkowicz; The Class Imbalance Problem: Significance and Strategies; 2000
² Matthias Kohl; Performance Measures in Binary Classification; 2012.
³ Ivan Tomek; Two Modifications of CNN; 1976.
⁴ Miroslav Kubat, Stan Matwin; Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186; 1997.
⁵ Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip Kegelmeyer; SMOTE: Synthetic Minority Over-sampling Technique. 2002
⁶ Manal Alghamdi, Mouaz Al-Mallah, Steven Keteyian, Clinton Brawner, Jonathan Ehrman, Sherif Sakr; Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project. 2017
⁷ Sara Fotouhi, Shahrokh Asadi, Michael W Kattan; A Comprehensive Data Level Analysis for Cancer Diagnosis on Imbalanced Data; 2019.
⁸ Jason Brownlee; SMOTE for Imbalanced Classification with Python; 2020.
¹⁰ Jason Brownlee; How to Combine Oversampling and Undersampling for Imbalanced Classification; 2020.
¹¹ Hadi Mansourifar, Weidong Shi; Deep Synthetic Minority Over-Sampling Technique; 2020.