Imbalanced Classification for Medical Diagnosis

Published in

Analytics Vidhya

6 min readJul 7, 2020

This article explains how you can reduce the bias of a neural network trained for medical diagnosis on a data set having a low prevalence of the disease.

Overview

In medical diagnosis studies, Imbalanced Classification is a common challenge. For almost any disease, a medical laboratory has more patients not having rather than having it. In the training set, having the naturally occurring prevalence of patients with the disease of interest would cause any loss function based model to be skewed towards negative classification¹ as is it is not going to receive enough loss for patients with the disease.

In the test set, having the same patients distribution would cause the test to report overly optimistic accuracy only slightly affected by the incorrect classification of patients with the disease. Out of all statistical measures² commonly used for measuring binary classification performance, only recall is not going to yield an overly-optimistic value.

Out of all common statistical measures only recall is not going to yield an overly optimistic value for the imbalanced diagnosis problem. TP — true positives, TN — true negatives, FP — false positives, FN — false negatives

The situation is also aggravated by the fact that we care more about eliminating Type II errors (false negatives) than Type I (false positives). In medical practice, it is almost always better to run some additional tests for a patient who doesn’t have the disease rather than leave untreated the patient who does.

Here’s an example of an imbalanced dataset (after applying principal component analysis to its features) containing 141 urine samples for patients with Bacteriuria infection and 663 samples for patients not having the infection:

Picture #1: Bacteriuria patients after applying PCA to the lab results

In this particular case, we want the classifier to narrow down the green cluster as much as possible and we would tolerate some significant amount of green dots outside of this cluster. But there’s just not enough red dots in the borderline region for a model to make such generalization.

In the picture below I outlined the presumed normal region in blue. We want all the samples outside this region to be classified as red and we can afford misclassification of all the green points outside this region. It is better to be extra cautious here because every misclassified red point is an untreated infection.

Such classification would cause a loss function to increase from its optimal value that can be achieved by a model because there are more green than red dots in the borderline region.

Picture #2: The normal region and the majority of cases highlighted

Let’s train a simple dense neural network with 3 layers containing 3 tanh, 2 tanh, and 1 output sigmoid neurons. The model is going to use Binary Crossentropy as its loss function and we will train it using Stochastic Gradient Descent for 100 epochs on the data set above.

As forecasted, the neural network reports a relatively high accuracy value of 0.92, but a low recall value of just 0.60.

The high accuracy value must not deceive you. The neural network is doing a poor job misclassifying a lot of red dots. The high accuracy value can be just attributed to the high amount of properly classified greed dots at the bottom of the normal region.

The way to approach this problem is called resampling and can be done by either undersampling the majority class or oversampling the minority class.

Resampling

Undersampling

Random undersampling might not work well in case when there is not enough data for the model to generalize.

A method called Tomek links³ can be used to identify borderline samples and thus remove the remaining samples as they are less important for generalization and testing.

Another idea called SHRINK system⁴ classifies the overlapping regions of positive and negative classes as positives arguing for undersampling by removing the negative examples from that region.

It works in some cases but might not work well with the Bacteriuria dataset above. As you can see from Picture #2 the high-dense green region outlined in orange overlaps with a moderately sparse region or red points. Removing green points from this region won’t increase the model’s recall significantly.

Oversampling with replacement

Oversampling with replacement can be done by just duplicating minority examples from the original data set to match the number of cases in the majority class. The problem is that this approach doesn’t provide any new information to the model during training, and thus it doesn’t necessarily lead to a better generalization.

Synthetic Oversampling

The idea of synthetic oversampling comes from the fact that for a model it could be easier to particularize than generalize. Thus, a simpler generative model can be built to provide the necessary diversity for a more complex classifying model.

The main challenge in synthetic oversampling methods is that even though a safe area can be detected inside the minority class region with no majority samples in it, too many new cases there can actually decrease the model’s accuracy for borderline cases. The model is not going to receive enough loss from the borderline cases, and thus it will be less accurate in classifying them. It is the opposite of our goal.

Adding just borderline cases with some randomization can also be tricky as random noise can swap an example’s label according to some classifiers such as k nearest neighbors. Even though it cannot make an image of a cat to look like a more rare iguana, random noise can easily make a borderline-sick patient look borderline-healthy in the doctor’s eyes.

One of the successful approaches in synthetic oversampling is called SMOTE⁵ (Synthetic Minority Over-sampling Technique), where new samples are generated along the lines inside k nearest neighbors clusters. SMOTE shows improved results for medical diagnosis in several studies, for example, diagnosing diabetes⁶ and cancer⁷.

SMOTE implementation⁸ is available in Python⁹ and it even allows combining¹⁰ synthetic oversampling with undersampling for better results.

Here’s the same Bacteriuria data set as above but after applying SMOTE to it:

Picture #3: Data set from picture #1 after applying SMOTE

Red cluster boundaries are reinforced now and during the training, a model is going to receive enough loss from borderline red cases leading to a better generalization.

The neural network described above and trained on the data set with SMOTE samples reports a less optimistic accuracy value of 0.87 (previously 0.92) but a much better recall value of 0.75 (previously 0.60).

Another approach called Deep SMOTE¹¹ claims to improve SMOTE performance in terms of precision, F₁, and AUC by using a deep neural network for generating synthetic samples.

Conclusions

Imbalanced Classification is an important problem that needs to be addressed in Medical Diagnosis studies. Fortunately, the problem is well studied, and various successful methods are proposed and implemented to address it based on data interpolation, as well as deep learning.