Imbalanced Classification for Medical Diagnosis

Volodymyr Frolov
Jul 7 · 6 min read

This article explains how you can reduce the bias of a neural network trained for medical diagnosis on a data set having a low prevalence of the disease.


In the test set, having the same patients distribution would cause the test to report overly optimistic accuracy only slightly affected by the incorrect classification of patients with the disease. Out of all statistical measures² commonly used for measuring binary classification performance, only recall is not going to yield an overly-optimistic value.

Image for post
Image for post
Out of all common statistical measures only recall is not going to yield an overly optimistic value for the imbalanced diagnosis problem. TP — true positives, TN — true negatives, FP — false positives, FN — false negatives

The situation is also aggravated by the fact that we care more about eliminating Type II errors (false negatives) than Type I (false positives). In medical practice, it is almost always better to run some additional tests for a patient who doesn’t have the disease rather than leave untreated the patient who does.

Here’s an example of an imbalanced dataset (after applying principal component analysis to its features) containing 141 urine samples for patients with Bacteriuria infection and 663 samples for patients not having the infection:

Image for post
Image for post
Picture #1: Bacteriuria patients after applying PCA to the lab results

In this particular case, we want the classifier to narrow down the green cluster as much as possible and we would tolerate some significant amount of green dots outside of this cluster. But there’s just not enough red dots in the borderline region for a model to make such generalization.

In the picture below I outlined the presumed normal region in blue. We want all the samples outside this region to be classified as red and we can afford misclassification of all the green points outside this region. It is better to be extra cautious here because every misclassified red point is an untreated infection.

Such classification would cause a loss function to increase from its optimal value that can be achieved by a model because there are more green than red dots in the borderline region.

Image for post
Image for post
Picture #2: The normal region and the majority of cases highlighted

Let’s train a simple dense neural network with 3 layers containing 3 tanh, 2 tanh, and 1 output sigmoid neurons. The model is going to use Binary Crossentropy as its loss function and we will train it using Stochastic Gradient Descent for 100 epochs on the data set above.

As forecasted, the neural network reports a relatively high accuracy value of 0.92, but a low recall value of just 0.60.

The high accuracy value must not deceive you. The neural network is doing a poor job misclassifying a lot of red dots. The high accuracy value can be just attributed to the high amount of properly classified greed dots at the bottom of the normal region.

The way to approach this problem is called resampling and can be done by either undersampling the majority class or oversampling the minority class.



A method called Tomek links³ can be used to identify borderline samples and thus remove the remaining samples as they are less important for generalization and testing.

Another idea called SHRINK system⁴ classifies the overlapping regions of positive and negative classes as positives arguing for undersampling by removing the negative examples from that region.

It works in some cases but might not work well with the Bacteriuria dataset above. As you can see from Picture #2 the high-dense green region outlined in orange overlaps with a moderately sparse region or red points. Removing green points from this region won’t increase the model’s recall significantly.

Oversampling with replacement

Synthetic Oversampling

The main challenge in synthetic oversampling methods is that even though a safe area can be detected inside the minority class region with no majority samples in it, too many new cases there can actually decrease the model’s accuracy for borderline cases. The model is not going to receive enough loss from the borderline cases, and thus it will be less accurate in classifying them. It is the opposite of our goal.

Adding just borderline cases with some randomization can also be tricky as random noise can swap an example’s label according to some classifiers such as k nearest neighbors. Even though it cannot make an image of a cat to look like a more rare iguana, random noise can easily make a borderline-sick patient look borderline-healthy in the doctor’s eyes.

One of the successful approaches in synthetic oversampling is called SMOTE⁵ (Synthetic Minority Over-sampling Technique), where new samples are generated along the lines inside k nearest neighbors clusters. SMOTE shows improved results for medical diagnosis in several studies, for example, diagnosing diabetes⁶ and cancer⁷.

SMOTE implementation⁸ is available in Python⁹ and it even allows combining¹⁰ synthetic oversampling with undersampling for better results.

Here’s the same Bacteriuria data set as above but after applying SMOTE to it:

Image for post
Image for post
Picture #3: Data set from picture #1 after applying SMOTE

Red cluster boundaries are reinforced now and during the training, a model is going to receive enough loss from borderline red cases leading to a better generalization.

The neural network described above and trained on the data set with SMOTE samples reports a less optimistic accuracy value of 0.87 (previously 0.92) but a much better recall value of 0.75 (previously 0.60).

Another approach called Deep SMOTE¹¹ claims to improve SMOTE performance in terms of precision, F₁, and AUC by using a deep neural network for generating synthetic samples.



² Matthias Kohl; Performance Measures in Binary Classification; 2012.

³ Ivan Tomek; Two Modifications of CNN; 1976.

⁴ Miroslav Kubat, Stan Matwin; Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186; 1997.

⁵ Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip Kegelmeyer; SMOTE: Synthetic Minority Over-sampling Technique. 2002

⁶ Manal Alghamdi, Mouaz Al-Mallah, Steven Keteyian, Clinton Brawner, Jonathan Ehrman, Sherif Sakr; Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project. 2017

⁷ Sara Fotouhi, Shahrokh Asadi, Michael W Kattan; A Comprehensive Data Level Analysis for Cancer Diagnosis on Imbalanced Data; 2019.

⁸ Jason Brownlee; SMOTE for Imbalanced Classification with Python; 2020.

scikit-learn-contrib / imbalanced-learn

¹⁰ Jason Brownlee; How to Combine Oversampling and Undersampling for Imbalanced Classification; 2020.

¹¹ Hadi Mansourifar, Weidong Shi; Deep Synthetic Minority Over-Sampling Technique; 2020.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store