Multi-label Classification of Thoracic Diseases

Published in

Institute for Applied Computational Science

12 min readDec 15, 2021

This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course within the Institute for Applied Computational Science.

Authors: Jie Sun, Rowana Ahmed, Nellie Ponarul, Saul Holding, Anna Wuest

All of the source code and data can be found on GitHub.

Please check our video we made for our project: https://youtu.be/8_IipEEkvfs.

Background

Chest x-rays are the most common screening technology for screening and diagnosing lung diseases 1. They are commonly used to identify and assess the condition of heart problems and fractures, but they are most commonly used to identify and assess a wide array of lung diseases. Indeed, approximately 1 in 7 adults have some form of active lung disease 2. To that end, radiologists are experts at interpreting imaging exams, including chest-x rays 3. However, with the development of artificial intelligence, experts have considered using deep learning models to generate disease classifications for the purpose of optimizing the radiologist diagnosis workflow, ultimately allowing for more accurate screenings and overall assisting physicians make life-saving medical decisions. Unfortunately though, efficient models have yet to be officially incorporated in the radiologist workflow, but that has not stopped people and companies from trying.

Indeed, deep learning classification is now being widely explored as a method to assist radiologists in the clinical diagnosis process, but there are significant challenges to integrating these predictive models into the clinical workflow. These challenges include industry-wide confusion around the usefulness of these models, confusion around how they will be deployed in the current workflow, and the overall lack of interpretability around model outputs, something that is critical to healthcare4.

However, previous work has demonstrated that deep learning models on chest x-rays have had high performance 4 and highlights a fantastic opportunity of utilizing computer vision to assist with chest radiograph diagnosis. Indeed, one of the largest issues around radiographic diagnosis of thoracic diseases, diseases that are commonly diagnosed with chest radiographs, is that the diseases have overlapping characteristics which complicate diagnosing one condition over the other, slowing down diagnosis time and increasing misdiagnoses. Now, our team strongly believes that deep learning can alleviate this problem. In fact, recent studies achieved sensitivities ranging from 87.9% to 93.6% when predicting lung cancer, tuberculosis, pneumonia, and pneumothorax from chest x-rays, critical percentages as these percentages were higher than the sensitivity achieved by on-call radiologists 4. These results have driven great interest toward implementing deep learning models into the clinical workflow and, ultimately, motivates our work.

Moreover, utilizing a deployed deep learning model for chest radiograph diagnoses can be particularly helpful in low resource settings where specialists are not available to provide accurate diagnoses for lung diseases 5. As global health experts dedicate resources to mobilizing x-rays to rural and underserved areas, a diagnostic tool coupled with medical equipment availability can improve patient care 5, and save real lives.

Companies such as Portal Telemedicina, a startup founded in Brazil that has created a telehealth infrastructure platform to provide patients in remote settings with specialized care, can potentially incorporate diagnostic applications into their existing platform 6. This startup facilitates training to non-specialized clinical staff in remote communities in Brazil to capture medical images with on-site equipment, which seamlessly uploads the images to the startup’s data platform, so a radiologist, who likely lives in urban Brazil, can analyze the images. But specialist availability is limited and can vary, so diagnoses are often not timely, leading to disease progression and can even lead to death. However, a diagnostic app can potentially expedite this process as a deep learning model could potentially be used to provide an initial diagnosis that can then be referred to a specialist for confirmation if necessary.

Data

To construct our model we are using data from the US Department of Health and Human Services’ National Institute of Health (Chest X-ray14) 7. It consists of 112,120 chest frontal-view radiographs (chest x-rays) from 30,805 unique patients where each patient is either healthy or has at least one of the 14 different pathologies 7. The pathologies include the following fourteen common thoracic diseases: Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural thickening, Cardiomegaly, Nodule, Mass, and Hernia 7. We can use our model to predict which combination of these thoracic diseases a patient may have, if any. Different pathologies manifest with visual differences in the x-ray and different portions of the image are critical for determining each of the classes.

Figure 1. Thoracic disease identification in x-ray imaging.

The labels for each image were constructed through a natural language processing model of radiology reports. The department reports a greater than 90% accuracy for label assignment on this dataset 1. Due to this, it is important to note that there may be some misclassification of x-ray diagnosis, and indeed that is what our results suggest. We dive deeper into this later.

While the Chest X-ray14 dataset was advertised as having at least 90% accuracy for label assignment, further research has indicated the data labels may be even more unreliable. Oakden-Rayner determined that about 84% of the images represent patients who have more than one image in the dataset, which significantly decreases the number of independent samples we would expect if one patient was represented with one image throughout.

Furthermore, most of these patients are ICU patients who have chest x-rays recorded daily, which further reduces the diversity of the dataset.

Additionally, Oakden-Rayner conducted an expert review of a sample of labeled chest x-rays from the Chest X-ray14 dataset for each disease class and calculated the positive predictive value for each disease class using his expert review as the ground truth label and the label given in the dataset as the predicted label. Through this approach, Oakden-Rayner determined that the positive predictive value for each disease class ranges from as low as 10% (for Emphysema) to 90% (for Pneumothorax). Furthermore, he determined that there were a large number of false negatives found in the dataset. For our purposes, we train our model using the existing dataset and labels, but understand that future training will be needed if more accurately labeled data becomes available.8

Exploratory Data Analysis

During our exploratory data analysis, we wanted to ensure that our training and testing datasets have similar distribution of attributes and are also representative of the general population to avoid unintended biases. Both the training and testing datasets have similar gender distributions, with males slightly overrepresented compared to females. Additionally, as shown below, the age distribution of the test set closely matches the age distribution of the training set. Most patients range in age from 35–65 years old, and the full age range represented in the dataset spans from 0 to 95 years.

Table 1. Gender Balance across Datasets

Figure 2. Age Distributions Across Training and Test Sets

Figure 3. Percentage of Disease in the Training dataset after stratifying sex.

Figure 4. Percentage of Disease in the Test dataset after stratifying sex.

For most disease classes, the data is balanced across genders. However, there is limited data for cardiomegaly in male patients in both the training and testing dataset, and limited data for females with Consolidation in the training dataset. This may result in biased accuracy results, and we will need to explore these limitations in our final deployed model.

Model

We have trained three baseline models.

The architecture:

CNN model with structure based on the structure by Abiyev et al. 9 :

Layer 1: Input Layer, Input Image, 32x32x1 with ‘zero center’ normalization
Layer 2: Hidden layer 1, Conv1 + BN + Relu Pool1, 16 feature maps of size 10x10, 2x2 kernel size with stride of 2
Layer 3: Hidden layer 2: Conv2 + BN Rely Pool2, 32 feature maps of size 10x10, 2x2 kernel size with stride of 2
Layer 4: Hidden layer 3, classification layer, Conv3 + BN + Relu FC softmax, 64 feature maps of size 10x10, 2 fully connected layers 2 units

2. DenseNet, pre-trained on ImageNet

In a DenseNet architecture, each layer is connected to every other layer, hence the name Densely Connected Convolutional Network.

3. MobileNet, pre-trained on ImageNet and replacing the top layer with:

A flatten layer
A dense layer with relu
A dropout layer
Another dense layer with relu
A final dense layer with sigmoid activation

Here are the loss and test accuracies of these models:

Table 2: Final Results of the Three Trained Models

From these results, we can see that the DenseNet outperformed the model based on the Abiyev et al 9 paper, however the MobileNet performed the best, although only slightly.

Next, we look at the training history of MobileNet, the Best Performing Model:

Figure 5: Training History of Mobilenet

We can see that the model trains well, and fine tuning the model helps improve overall binary accuracy. However, there are some signs of overfitting as the training gets into the fifth epoch, but the accuracy indicates that the model is performing well.

Finally, since our dataset is imbalanced, we want to look at the individual class accuracy:

Figure 6. Accuracy by Class in the MobileNet model.

We can see that the model performs well on predicting disease type when the patient is ill and the success metric is accuracy. This is promising as there is a disproportionate negative impact to the well-being of the patient when their disease is not diagnosed. Indeed, this can lead to later morbidity and death. However, it is important to note that our model has difficulty classifying the no-finding class. We suspect that this is potentially due to the relatively higher variability that exists between images that have no associated disease and due to the false negative issue that Oakden-Rayner8 illustrated. Finally, as our model is designed to be able to predict more than one disease in one image, with a cutoff threshold of 0.5, there are some images that are not classified as any disease type. After a thorough investigation, we believe this issue arises, once again, due to the data quality issues that were investigated by Oakden-Rayner8. For now, as we look for a better labeled dataset, we decided instead to assign inconclusive labels to these images for now and prompt the radiologist to investigate the image further.

Our Final Model & What It Is Looking At

We use MobileNet as our final model, as it has the highest validation and test binary accuracy.

To investigate how well our final model performs other than looking at binary accuracy and disease class accuracy, we also want to assess what our model is looking at when making its classification, a saliency map is the perfect tool for the job. We can see in Figure 7, that the model looks at various areas across the patient’s chest to classify the image as “No Finding”. Indeed, this makes sense as we would expect a model to look across the chest for diseases before coming to the conclusion that the patient does not have a disease.

Figure 7. Saliency Map from MobileNet of No Finding

In contrast to this, we can see that in figure 7, the model looks at a specific part of the patient’s lung to diagnose Atelectasis, a partial or complete collapse of the lung. More importantly, the model is looking at the area where the collapse of the patient’s lung persists: the patient’s lower right quadrant.

Figure 8. Saliency Map from MobileNet of Atelectasis Prediction

All in all, we can see that the model is performing well in three specific areas. First, the model’s binary accuracy is high, with an accuracy score of 90.96%. Second, the model does well at predicting disease types other than no-finding, which is critical in the medical context as incorrect disease predictions can lead to morbidity and death. And finally, the model looks at the correct area of the lungs when making its classifications. Taking all three into account, there are clear signs that machine learning models have the ability to assist radiologists in their disease classification workflow. More specifically, although there is more that needs to be done which ranges from improved modeling to bias checks, a tool that is effectively deployed and integrated into the radiologists workflow has the ability to transform how disease diagnoses are done.

Discussion

As mentioned in the previous section, there are powerful indicators which illustrate that deep learning models can improve the radiological diagnosis workflow, but there is still more work that needs to be done. For now, multi-label classification of thoracic diseases using chest-x rays proves to be a challenging process, even though the issue likely arises from the dataset’s mislabeling. While we reach a high overall binary accuracy and accuracy for patients with disease, the model struggles with classifying “No Finding” radiographs. Moreover, many images were classified as inconclusive as there were radiographs that were not classified as having any disease type. This result remained consistent regardless of the model that we utilized.

Ultimately, as mentioned before and thanks to Oakden-Rayner8, we suspect that this largely has to do with data quality, since it is now known that the dataset that we are using is infamous for being significantly mislabeled, particularly around false negatives, leading to inaccuracies for the “No Finding” label. Furthermore, a large proportion of the dataset is made up of daily chest x-rays from a smaller subset of patients, reducing the diversity of the dataset and ensuring there is a lack of independence between some images in our data set. Hence, we believe that we were severely limited by our dataset, and our next critical step is to get access to better data. If it were possible to retrain our model on a more reliable dataset, the implementation of this model as a web application has the power to significantly assist in the diagnostic workflows across the globe, and more critically, in areas where there are little to no resources.

Deployment

Our app interface will include an image input method, where the model will predict the classification probabilities, the image for reference, and a saliency map of the image, corresponding to the prediction with the highest probability.

Application Components

Our chest x-ray thorax disease diagnostic app consists of a React.js front-end and a web API service utilizing Fast API, and is deployed using a Kubernetes cluster.

React Front End: Although we have also constructed a simple frontend using HTML/Javascript, our final deployed application will leverage the React framework 10. The application interface has a front page that takes in a chest x-ray and displays thorax disease predictions below and a page that displays the saliency map of the inputted image to add context for which parts of the image the model deemed were most important for the predictions it made.
API Service: Our model is served using the FastAPI, and the predicted results are exposed to the front end via a REST API.

We first pushed our containers to the Google Cloud Registry and then created a Kubernetes cluster and deployed our React frontend and FastAPI backend from the Google Cloud Registry to the Kubernetes cluster using Ansible playbooks within a local Docker container.