Project Roadmap for X-Ray Classifiers: MICCAI Educational Challenge

Ben D and Solace H
MICCAI Educational Initiative
17 min readNov 24, 2020

Introduction

Starting your first Machine Learning (ML) project can be a daunting and overwhelming task. This tutorial was made to help you develop your own ML project specifically in the Medical Imaging field. We, Solace and Ben, are undergraduate students going into our third year studying Engineering Science at University of Oxford. This summer we were given the opportunity to work for a month with our tutor Bartek Papiez, tasked with developing an X-Ray Classifier for Foreign Object Detection. By the end of this tutorial you will have all the tools necessary to build any sort of image classifier to participate in the next MICCAI, MIDL or ISBI Challenge (or for your very own application). We hope you will find you don’t need to be an expert to get some very good results!

This roadmap assumes that you have a moderate understanding of Python, including Object Oriented Programming, and some basic understanding of medical imaging. Our only experience before starting the project was a 5-day crash course in ML, with a small introduction to PyTorch, which gave us a good starting platform. Since not everybody has a tutor there to help them, we have created this roadmap to help you through the process. Here we have compiled the whole journey from start to end. We have combined information from tutorials and other resources we used along the way together with some of our own advice based on our own experience during our project.

First Steps

To give ourselves a better understanding of the fundamentals needed for our project, we started by following the tutorial: ‘Deep Learning with PyTorch: A 60 Minute Blitz’. It is important to try and understand the steps they are taking, but not all elements are equally important to start with. For example, we believe it is important to understand what Autograd is and why it is used, however the implementation details on how the dynamic computational graph is built is not essential to developing a classifier. PyTorch is great because there are so many example codes and tutorials, so you do not need to understand the ins and outs of each line of code to be able to produce a working program with good results.

The PyTorch documentation is very detailed, and any questions you have will most likely be already answered on StackExchange or other similar forums.

Recommended Software / IDE

For our project we used Google Colaboratory as our primary coding software. Google Colaboratory (Colab) is an online Python executer, using it is very intuitive and if used properly it can be a very powerful tool. The main advantage being the free access to a GPU, as most of us have limited computational power on our machines at home. GPUs are important because they are more effective than CPUs for deep learning, especially for training, which can take a very long time to complete.

One of the first things you will realise once you start coding your classifier, is that you will be writing a lot of different versions of your code. To avoid confusion make sure you start by titling your files very clearly with the version of the code that way you know exactly what you are opening. Also make sure your code could be understood by anyone and is easy to edit. The best practice is to keep your work well segmented and each segment well labeled and commented to minimise the amount of time spent searching for that one line of code.

The main advantage of using Colab is how simple it is to import your dataset and your results between your drive and your Colab file.

Importing Images and CSV Files

You can import from, and save files to, your google drive by mounting your google drive to your Colab notebook.

from google.colab import drive
drive.mount('/content/drive')

The way in which we organised our folders was as so:

# ─── DATA_DIR
# ├── train
# │ ├── #####.jpg
# │ └── …
# ├── dev
# │ ├── #####.jpg
# │ └── …
# ├── test
# │ ├── #####.jpg
# │ └── …
# ├── train.csv
# └── dev.csv
# └── test.csv

Inside the CSV files were the image names and annotations of locations of foreign objects. To read in CSV files, you can use Pandas DataFrames. These can then be converted into dictionaries.

HOW TO IMPORT CSV FILESimport pandas as pd#drive path to the DATA_DIR folder
data_dir = '/content/drive/My Drive/DATA_DIR/'
#import the csv files
labels_tr = pd.read_csv(data_dir + 'train.csv', na_filter=False)
labels_dev = pd.read_csv(data_dir + 'dev.csv', na_filter=False)
labels_test = pd.read_csv(data_dir + 'test.csv', na_filter=False)
#Convert DataFrames into dictionaries
img_class_dict_tr = dict(zip(labels_tr.image_name, labels_tr.annotation))
img_class_dict_dev = dict(zip(labels_dev.image_name, labels_dev.annotation))

Each of these three folders — train, dev (also called validation/val) and test — have different purposes. Training data is what the model learns on, and is used to fit the model. The validation data is used to evaluate a model during training. One way of doing this is after each epoch, only re-saving the model if the current value of the evaluating parameter is better than the previous best value. Evaluating parameters could be, for example, loss, AUC — more on this later — or accuracy, and are calculated using the validation data. You do this because you do not want your model to overfit to your training dataset. The test data is used to provide an unbiased evaluation of the final model. It is important that you do not involve your test data during training, and you do not use your training dataset for testing. With the former, you will bias your model to fit the test results, even though it is meant to be an independent way of evaluating your model, to imitate use in real-world applications. With the latter, your model would appear to preform much better if using a model trained on that exact data. For further information, see this article.

Limitations

There are also a few sometimes frustrating limitations of Colab. Since Colab is a free service available to everyone, there are limits on the time, memory, and GPU usage.

Colab caps the length of an active runtime to 12 hours, this means that you can only train your network for a total of 12 hours at once. This can be a serious handicap if your dataset is large or you need to train it multiple times. The way around this is to save your network while training, for example after each epoch. That way if the runtime turns off you can pick up where you left off, load the network back in and keep training your network.

HOW TO SAVE YOUR MODEL AFTER TRAINING AND LOAD YOUR SAVED MODELimport torch#saving your model to drive path model_dir
model_dir = '/content/drive/My Drive/.../classification_model.pt'
torch.save(model.state_dict(), model_dir)
#loading a saved model after having retrieved appropriate model architecture from torchvision
model.load_state_dict(torch.load(model_dir))

When training particularly heavy classifiers you may get the error message “You cannot currently connect to a GPU due to usage limits in Colab”. There unfortunately is no way to keep training your network after receiving this message. There are two options, you can create a new google account and share the project with that new account, or you have to wait until the usage resets in order to keep training your network. The time this takes varies depending on worldwide demand, but in our experience can take around 12 hours. With memory it is a bit easier, you need to ‘Factory Reset Runtime’ (under the Runtime drop-down menu) and reload the page to clear the memory and get back to training your network!

Investigation of the Dataset

The dataset we used was from the Object-CXR Challenge: Automatic Detection of Foreign Objects on Chest X-rays. This consists of 10,000 images: 8000 training, 1000 validation and 1000 test, with half of each of these datasets containing images with foreign objects. Every image is annotated with coordinates of foreign objects.

So why is foreign object detection an important task for healthcare? This question is answered well in this article. One key example, for our application, that they give involves pneumothoraces. Many of the images in the pneumothorax dataset contain large bore chest drains. It is trivially true that many patients that have a large bore chest drain have a pneumothorax, as it is a method of treatment. For a clinically useful classifier, you want to be able to identify a pneumothorax so that patients who need it can be treated. If a classifier is trained on images, of which many contain chest drains, your model may learn features of a chest drain rather than of a pneumothorax. Foreign Object detection can be used to mitigate the risk of a model learning objects associated with a condition, such as a large bore chest drain, rather than the condition itself, pneumothorax.

The quality of your training data is an important factor to consider. If your annotations are incorrect or mislabelled, then your results may be biased or inaccurate. Therefore it is very important to investigate your dataset. Before training anything, print some data and see what annotations look like.

Viewing Images with Annotations in Pytorch

HOW TO VIEW IMAGES WITH ANNOTATIONSfrom PIL import Image, ImageDrawANNOTATION_SEP = ';'
OBJECT_SEP = ','
#Function to draw the annotations on images
def draw_annotation(im, anno_str, fill=(255, 63, 63, 40)):
draw = ImageDraw.Draw(im, mode="RGBA")
#if multiple annotations in the images, split them up then draw
for anno in anno_str.split(OBJECT_SEP):
anno = list(map(int, anno.split(ANNOTATION_SEP)))
if anno[0] == 0:
draw.rectangle(anno[1:], fill=fill)
elif anno[0] == 1:
draw.ellipse(anno[1:], fill=fill)
else:
draw.polygon(anno[1:], fill=fill)
#Create a subplot of 4 example images
fig, axs = plt.subplots(nrows=1, ncols=4, subplot_kw=dict(xticks=[], yticks=[]), figsize=(24, 6))
#Display images 00001.jpg, 00002.jpg, 00003.jpg, 00004.jpg
example_idxes = [0, 1, 2, 3]
for row, ax in zip(labels_tr.iloc[example_idxes].itertuples(index=False), axs):
im = Image.open(data_dir + "train/" + row.image_name).convert("RGB")
if row.annotation:
draw_annotation(im, row.annotation)
ax.imshow(im)
ax.set_title(f"{row.image_name}")

In the Object-CXR dataset, we found 9 images out of the first 100 ‘dev’ images with partially missing annotations, one can be seen in the left of Figure 1. With binary classification, this was not a problem because the image was still classified as containing foreign objects due to the other annotations in the image. However, there were 2 images which had no other annotations, as can be seen in the right of Figure 1, and so were misclassified.

Figure 1 — Images from the ‘dev’ dataset. Left — Partial missed annotations. Right — Full missed annotations.

Preprocessing of Data

Before training a classifier, it is important to implement two performance improving techniques: data normalisation and data augmentation.

Data Normalisation

The dataset contained many different sized images, so we resized the images to 800x800 pixels. The reason why we chose this size is that the limits on the input size for the architecture used in the Baseline for the Object-CXR Challenge were min 800, and max 1333. Due to the limitations in Colab, to maximise the amount of images we can train on at once, and minimise the time to train, we chose the smallest possible input size at 800x800. Since we observed good results with this image size we kept it the same in our architectures. We also normalised the pixel values to the standard values, mean = [0.485, 0.456, 0.406], standard deviation = [0.229, 0.224, 0.225], so that each input has a similar data distribution. See here for more information on why this is important.

Data Augmentation

We implemented several transformations to improve the accuracy of our model and prevent it from simply learning the training dataset. Data augmentation is the process of increasing the amount of training data by applying different transformations to the existing data.

The PyTorch Transforms package comes with many augmentations that can be easily applied to your data. Random transformations are applied to individual images as the model trains on them. This means that when the model looks at the same image, in each epoch it will probably be transformed in a different way. This improves the performance of the network without requiring more data to be collected.

The augmentations we did were random horizontal flips with a probability of 0.5, random vertical flips with a probability of 0.5, random rotations between -15 and 15 degrees and brightness jitter between 0.85 and 1.05.

HOW TO IMPLEMENT DATA AUGMENTATION AND NORMALISATIONfrom torchvision import transformsinput_size = 800data_transforms = {
'train': transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.RandomRotation(15),
transforms.Resize((input_size,input_size)),
transforms.ColorJitter(brightness=(0.85,1.05)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'val': transforms.Compose([
transforms.Resize((input_size,input_size)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'test': transforms.Compose([
transforms.Resize((input_size,input_size)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])}

Validation and test data does not need to be augmented, as augmented data is used for improving the training of your network.

Importing your network

For a task such as ours we needed to classify the images into two groups: images with a foreign object and images without a foreign object. Since there were only two classes this is called a binary classification problem. Thankfully, instead of having to code the full architecture of a classifier network, PyTorch has a collection of cutting edge classifier architectures available to use. These models are very simple to import and you can follow this tutorial to learn how to use them.

The question now becomes which model to choose. There are so many different options, even within each general architecture. So how do you choose? Well it is quite a complex choice to make and depends on many factors but a good rule-of-thumb is as follows. If you have a small dataset then you will want a small model (e.g. ResNet18) to avoid the model overfitting to that particular dataset, losing generalisation; if you have a larger dataset and lots of time, choose a larger model (e.g. ResNet101 or DenseNet or InceptionV3). If you are planning on using this to develop a mobile app, SqueezeNet is the one for you. There are many different models each with advantages and disadvantages, if you are not sure which one to use, ResNet50 is usually a great place to start.

In any classifier network there are two main parts: the first part of the network is called a convolutional neural network (CNN). CNNs are very flexible and have shown great performance for a variety of tasks, learning everything from very simple features such as borders and general shapes to specific features relevant to the dataset. The second part (top layers) of the network is called the fully connected network (FCN), these top layers are responsible for the final classification. When loading your model the final FC layer is what you need to alter since your model needs to have the appropriate number of outputs for your specific task, in our case this was 2.

Generic Neural Network, feature learning corresponds to the CNN and classification to the FCN (source)

Since Pytorch offers ‘pretrained’ networks, our question at this point was what does a ‘pretrained’ model even mean? And why would we ever import a model that has been trained on a different dataset? The idea behind why this works is called transfer learning. Since the top layers classify an image based on the features extracted from the CNN, importing a pretrained model allows us to use the knowledge that the model has gained on a different dataset. Even if the dataset is not the same, the network will know more or less what sort of features to look for.

The last question is now what layers do you train? Well, this is an interesting question! Do you only update the top layer of the network that does the final classification or do you fine-tune the entire model and update all layers? The pretrained model you now load into your network is trained on the ImageNet dataset. While there is a lot that can be gained from transfer learning, the images from ImageNet are very different from X-ray images both in terms of features and colours. You may need to fine-tune all layers for the network to learn the features important in X-rays. Below you can find a code snippet to import a pretrained DenseNet network, if the variable ‘feature_extracting’ is set to True, the whole network is fine-tuned.

IMPORTING A NETWORK AND DETERMINING WHICH LAYERS ARE TRAINED#determine which if all or only some of the layers will be traineddef set_parameter_requires_grad(model, feature_extracting):
if feature_extracting:
for param in model.parameters():
param.requires_grad = False
#importing densenet and edit the final layer to the correct number of outputs (num_classes) in our case num_classes = 2def _get_model(num_classes,feature_extract):
model = models.densenet121(pretrained=True)
set_parameter_requires_grad(model, feature_extract)
num_ftrs = model.classifier.in_features
model.classifier = nn.Linear(num_ftrs, num_classes)
return model

Tensorboard

TensorBoard allows you to visualise your training process. You can plot graphs of loss, accuracy and AUC.

LAUNCHING THE TENSORBOARD# Load the TensorBoard notebook extension
%load_ext tensorboard
# Define place to store data
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(save_folder + 'runs')
#Launch the TensorBoard. It opens all TensorBoard files inside the path given (even if in subfolders)
tensorboard --logdir=drive/My\ Drive/.../model_v1

WRITING TO THE TENSORBOARD
#During training, add this line to your code to plot the loss
writer.add_scalars('Losses',{'Training':epoch_loss_tr,'Validation':epoch_loss}, epoch)
Figure 2: Losses in training dataset (blue) and validation dataset (red) over 20 epochs; Left — DenseNet; Right — ResNet101

You can see in Figures 2 that even though the loss in the training dataset is decreasing, the loss in the validation data is stable/increasing. This is because the network may be starting to learn the dataset rather than features of foreign objects. This is particularly evident in large networks such as ResNet101 and DenseNet, compared to smaller networks. Although we implemented data augmentations to mitigate this, it does not fully stop. This is why we save the best AUC rather than the model with the best loss.

TensorBoard can also be great for spotting other problems, for example a network that is not learning anything may have a very random loss function plot. If this is the case, you may need to check your model and dataset again.

See this PyTorch tutorial for more details.

Evaluation

When studying binary classification in the medical field a very important tool of evaluation is the Receiver Operating Characteristic (ROC) curve. The ROC curve is a plot of the ‘true positives rate’ (tpr) against the ‘false positives rate’ (fpr), we use the plot to evaluate our data by calculating the area under the curve (AUC). The higher the AUC the better with a theoretical max at 1, and anything below 0.5 being less accurate than a coin toss — assuming you are doing binary classification.

CALCULATING AND PLOTTING ROC AUC#calculating the AUC of your classifier, both gt (ground truth) and pred_prob (outputs of from the model) must be in a list formatfrom sklearn.metrics import roc_auc_score, roc_curve, aucfpr, tpr, _ = roc_curve(gt, pred_prob)
roc_auc = auc(fpr, tpr)
#the ROC curve can also be plotted usingimport matplotlib.pyplot as pltfig, ax = plt.subplots(subplot_kw=dict(xlim=[0, 1], ylim=[0, 1], aspect='equal'))
ax.plot(fpr, tpr, label= f'AUC: {roc_auc:.03}')
_ = ax.legend(loc="lower right")
_ = ax.set_title('ROC curve')

In addition to the ROC AUC, we can also use accuracy to generally test the performance of our classifier. Accuracy is a good way to test your model, especially useful when training a non-binary classifier.

Ensemble Models

Another technique used to improve the results of a classifier is to ensemble multiple models. Just like a panel of experts is better than a single expert, combining several independent models achieves better results.

The training process remains the same. Say you train two models, one with architecture A and the other with architecture B. Both models output probabilities of each image containing a foreign object. There are multiple ways that you can then combine the probabilities from model A and B in order to improve your final AUC value. The three we implemented were majority voting, average, and weighted averages based on a Softmax of each model’s individual AUC.

Through experimenting with different combinations, we found that weighted averages yielded the best results and we were able to obtain results quickly using all 6 of the models we trained, rather than having to manually experiment with different permutations.

Majority Voting

ENSEMBLE A MODEL USING MAJORITY VOTINGimport itertools
import collections
#Create empty dictionaries to store values
final_dict = collections.defaultdict(float)
vote_dict = collections.defaultdict(float)
novote_dict = collections.defaultdict(float)
vote_prob_dict = collections.defaultdict(float)
novote_prob_dict = collections.defaultdict(float)
counter_dict = collections.defaultdict(float)
# Iterating key, val with chain()
# If the value is above 0.5 then store the probability in vote_prob_dict, if it is below store it in novote_prob_dict. Take the average of whichever has more entries per image.
#Each dictionary contains the image names as keys and probability of the image containing a foreign object as items.
for key, val in itertools.chain(vgg_dict.items(),squeezenet_dict.items(),
resnet101_dict.items(), resnet50_dict.items(),
inception_dict.items(), densenet_dict.items()):
counter_dict[key] += 1
if val > 0.5:
vote_dict[key] +=1
vote_prob_dict[key] += val
else:
novote_dict[key] +=1
novote_prob_dict[key] += val
if vote_dict[key] / counter_dict[key] > 0.5:
final_dict[key] = vote_prob_dict[key] / vote_dict[key]
else:
final_dict[key] = novote_prob_dict[key] / novote_dict[key]
preds_prob = list(final_dict.values())

Average

ENSEMBLE A MODEL USING AVERAGINGimport itertools
import collections
store_dict = collections.defaultdict(float)
numofmodels = 6
# iterating key, val with chain(). For each key (image_name) add teh value (probability) from each dictionary
for key, val in itertools.chain(vgg_dict.items(),squeezenet_dict.items(),
resnet101_dict.items(), resnet50_dict.items(),
inception_dict.items(), densenet_dict.items()):
store_dict[key] += val / numofmodelspreds_prob = list(store_dict.values())

Weighted Average

ENSEMBLE A MODEL USING WEIGHTED AVERAGINGimport itertools
import collections
import math
#Manually input the AUC values. Make sure they are in the same order as you have listed them in intertools.chain(...)
AUC = torch.tensor([vgg_auc, squeezenet_auc,
resnet101_auc, resnet50_auc,
inception_auc, densenet_auc])
outputs = torch.nn.Softmax(dim=0)(AUC * 125)
store_dict = collections.defaultdict(float)
n = 0
num_models = 6
num_images = 1000 * num_models
# iterating key, val with chain()
for key, val in itertools.chain(vgg_dict.items(),squeezenet_dict.items(),
resnet101_dict.items(), resnet50_dict.items(),
inception_dict.items(), densenet_dict.items()):
#Determine the model using n as intertools is in order
model = math.floor(n/(num_images))
n += 1
store_dict[key] += val * outputs[model]

preds_prob_values = list(store_dict.values())
preds_prob = []
#Convert list of tensors to list of values
for i in range(len(preds_prob_values)):
preds_prob.append(preds_prob_values[i].item())

After finding the new preds_prob, recalculate the AUC as described in ‘Evaluation’.

Final Comments

We hope this tutorial has shown you that you do not need to be an expert to successfully complete a Machine Learning classifier project in the Medical Imaging field. In fact everything that we have explained here is pretty much applicable to any classifier project you could dream of. Hopefully you now understand that the data you choose is important, that more iterations does not necessarily lead to better results, and that it is not as simple as minimising the loss function.

Over the 4 weeks we worked on the project we achieved a score that would have placed us at 10th in the Object-CXR Challenge, however we started this project after the submission date had passed. And here are our results:

After all training ResNet50, ResNet101, SqueezeNet, VGG, InceptionV3, and DenseNet networks and trying all possible combinations to ensemble them together theses were our final results [Figure 3.1, 3.2].

Figure 3.1: Single Model AUC Results.

For the individual models our best results came from the ResNet101 model with a 0.948 validation AUC and 0.943 test AUC.

Figure 3.2: Ensemble Model AUC Results.

Overall our best result came from combining all the models with a weighted average using Softmax, we achieved a validation AUC of 0.952 and a test AUC of 0.951.

In Investigation of Dataset we raised our concerns about certain images being mislabelled. Our ensemble model predicted that the probability of image 08026.jpg, right image in Figure 1, having a foreign object is 0.987. However, the foreign object in this image was not identified in the dataset. Instances such as theses will affect your results.

If you are interested in looking more closely at our code, it can be found in our GitHub repository.

Further materials

Surprise! This is not everything there is to know about machine learning within the medical field, here are a few further links we found helpful while making our own way through this mess:

--

--