Methods for training a pretrained multimodal image classification model to detect skin malignancies

Dany TH
Institute for Applied Computational Science
8 min readDec 15, 2020

Authors: Daniel Chen, Dany Thorpe Huerta, Priyanka Soni, Leina Essakallihoussaini

GitHub Repo

This article was produced as part of the final project for Harvard’s AC295 Fall 2020 course.

Introduction

Skin cancer is by far the most common form of cancer. By the age of 70, 20% of Americans will develop skin cancer. The 5-year survival rate for melanoma when detected early is 99%. Since skin cancer is mostly initially diagnosed visually, it represents an opportunity to develop AI to detect the malignancy.

The goal of the SIIM-ISIC 2020 Challenge is to develop algorithms that are able to distinguish between an image taken from a patient with skin cancer (malignant) and a healthy patient (benign). We aimed to build a melanoma classification algorithm given dermoscopic images of potential melanomas and the image metadata in the form of Tabular data. We approached two ways: one where we created our own TFRecords to train the model and the other where we used TFRecords already created by a Kaggle Grandmaster. In this Medium article, we will explain, step-by- step, our approach to solve this Kaggle challenge.

The Data

The dataset was generated by the International Skin Imaging Collaboration (ISIC) and images are from the following sources: Hospital Clínic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute Australia, The University of Queensland, and the University of Athens Medical School.

The dataset was released by the International Skin Imaging Collaboration (ISIC) in partnership with Society for Imaging Informatics in Medicine (SIIM)for a Kaggle competition in 2020. JPEG images and demographic data such as sex and age are available.

The total size of the data set is 31Gb. The large majority of images are benign (more than 98%). It is therefore an imbalanced dataset.

Our TFRecords

Any rows with missing values were removed from our dataset such that we could perform a complete case analysis. Additionally, we added the RGB values, height, width and size to the original Kaggle dataframe for each image. Then, we resized all the images to 112 x 112 x 3. Once these preprocessing steps were performed, we split the data such that 60% of images were in a training directory and 40% of images were in a validation directory with target stratification.

Finally, we created TFRecords for all images and their information. There was one TFRecord created for each training and validation datasets. The TFRecords contained the following features: resized image, approximate patient age, patient sex, skin lesion anatomical site, image red value, image green value, image blue value, image width, image height, image size, and the target label encoding malignancy. We used all of this information when training our model.

Kaggle Grandmaster TFRecords

Kaggle Grandmaster, Chris Deotte, preprocessed images such that they were (1) centered and cropped based on the skin lesion ,and (2) resized to 256 x 256 x 3. The data was written to 15 Training TFRecords to be used with K-Fold cross validation. TFRecords were triple-stratified, or in other words, created such that (1) all images from the same patient were not included in the same fold , (2) malignant images were spread out evenly across folds, and (3) there’s a good mixture of patients with many images and of patients with few images within each fold. The TFRecords contained the following features: preprocessed image, image name, patient ID, patient sex, patient approximate age, skin lesion anatomical site, image width, image height, diagnosis and the target label encoding malignancy. We only used the image, sex, age, anatomic site, and target labels for our model.

The Model

We utilized a multi-modal pre-trained modeling approach. The inputs consist of images and metadata features. The pretrained modeling is used for images input and metadata features are being fed into an Multi-Layer Perceptron (MLP) architecture. These architectures are then concatenated and fed into a single dense layer. Various combinations of efficientNet pretrained models were tested for both the modeling approaches.

Model Architecture

Model Evaluation

In terms of model training evaluation, we tested on the following metrics:

  1. Model Debugging Tests:

These tests involved extensive usage of wandb dashboard to review if the model is converging correctly. This also helped us review if our selected batch size was utilizing the GPU resources appropriately. Images below show the model training is utilizing the GPU resources >90%.

2. Performance Metrics:

A classification report of the model results on a subset of validation data shows that the model suffers due to the imbalance within this dataset is for malignant cases.

Class 1 being a malignant class is where our focus is to make F1 score higher to be able to capture more malignant studies without overwhelming the hospital resources. This classification report was created with a threshold of >0.028. This does add a lot of false-positives to the malignant prediction but we would still be able to capture 50% of true malignant cases without adding too many false positives.

Classification report for our model given the imbalanced dataset

We did attempt to apply label smoothing with various alphas, however did not observe any better classification performance (not shown here).

The Area under Curve (AUC) on this sample is 0.79. This does show some significant learning.

The confusion matrix shows that this model was able to capture 9 out of 17 malignant cases but does add 93 extra false positives to the cases to review. We would still miss 8 malignant studies. This requires a trade-off analysis to implement the right threshold cut off.

Model Learning

As we had very poor performance, we needed to understand why. As this model can potentially expedite diagnosis and time to treatment, this model must have good melanoma classification accuracy.

We can look at some examples on a lower level to understand what might be wrong with our model visually.

The top left is true melanoma and the others images are all suspected benign spots. We can see major overfitting with the probability of melanoma being 0.02 which is similar to the ~98% class imbalance.

As can be seen in the image above, the melanoma classification probabilities of 9 heterogenous images. The top left one is actual melanoma and the rest are considered not melanoma. We can see a wide variety of different characteristics from the images including skin tone, hair, other blemishes, ruler markings, and one image which seemed to have been taken from a microscope. Probably due to all of this variety, it seems our model was unable to distinguish the dark red features of melanoma in our example.

In order to investigate more, we can first look at the activation heat maps. Activation heat maps in this case would allow us to understand what gets activated during our classification. This allows us to debug what is wrong with our model and where it is focused on. Images of the activation heatmap of the convolutional layer can be seen below.

An example benign image; the activation heatmaps are below.
The activation heatmaps for the above image. In the activation heatmaps, we can see that mostly the skin is activated and the hair and actual suspected melanoma is partially activated in some layers and not in the others.

The activation heatmap of the convolutional layer for the benign image shows that that most of the activation is on the skin around the mole and not the mole itself. It seems that our model might be checking for the presence of skin rather than melanoma.

We can also investigate SHAP values as an alternative method to explain our model. SHAP values are an approach to explain the output of machine learning outputs using a local model. For neural networks, SHAP values are approximated using the smoothed background data sample and inputs we want to explain. The SHAP values will then tell us the locations which increase our probability of a given class in the image and the locations which will decrease the probability of a given class.

SHAP explanation of an malignant image. We can see that SHAP explains that our gradients are all focused on some noise to the right of the actual melanoma mark.

We can see in the image above our SHAP values for this malignant melanoma image. We can see in the middle and right image that our model is noisy and is giving us seemingly unimportant areas which would help our model.

These multiple approaches greatly explain why our model performed sub-optimally as we can see that our model did not learn the difference between melanoma and benign spots.

Kaggle Performance

As expected, when we submitted our predictions to the Kaggle competition, we had poor performance. Our performance was 45.81% and 49.44% accuracy for public and private leaderboards respectively.

Conclusion and Future Work

Using an EfficientNet and multilayer perceptron model, we built a pretrained multimodal image classification model and used our own TFRecords and a Kaggle Grandmaster’s TFRecords to train it. The performances of the model, regardless of the TFRecord used to train and validate it, were not very satisfying. In order to debug it, we used wandb to make sure the model was properly converging. We tried to interpret the model to understand what was going wrong using different visualization techniques(Activation maps and SHAP) and it seems like the model is not picking up on the suspect part of the image (melanoma).

Moving forward, we aim to improve our model performance with image augmentation, potentially removing hair in the images. Additionally, we want to include images with darker skin tones since this dataset contained images of paler skin tones. This way our model will be accessible for all demographics and not augment any health disparities. Once our model performs well consistently, we would like to create a webapp for image classification prediction ideally for medical professionals to use. In the meantime, we will continue to work to improve our model so as to contribute to improving the prognosis of skin cancer patients via early detection.

Acknowledgements

We would like to thank our professor Pavlos Protopapas and the Harvard Applied Computation 295 course teaching staff for their guidance and support.

--

--