Melanoma Classification: Getting a medal on a Kaggle competition

Published in

Analytics Vidhya

7 min readSep 14, 2020

Using deep learning to identify melanomas from skin images and patient meta-data

Source: https://storage.googleapis.com/kaggle-competitions/kaggle/20270/logos/header.png?t=2020-05-06-18-21-24

Kaggle, SIIM, and ISIC hosted the SIIM-ISIC Melanoma Classification competition on May 27, 2020, the goal was to use image data from skin lesions and the patients meta-data to predict if the skin image had a melanoma or not, here is a small introduction to the task from the hosts:

Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It’s also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection — potentially aided by data science — can make treatment more effective.
Currently, dermatologists evaluate every one of a patient’s moles to identify outlier lesions or “ugly ducklings” that are most likely to be melanoma. Existing AI approaches have not adequately considered this clinical frame of reference. Dermatologists could enhance their diagnostic accuracy if detection algorithms take into account “contextual” images within the same patient to determine which images represent a melanoma. If successful, classifiers would be more accurate and could better support dermatological clinic work.

I took part in the competition and after about 2 months and about 200 experiments got a bronze medal finishing at 241st among 3314 teams (Top 8%), during the competition I also published two kernels one about visualizing data augmentations and another about using SHAP to explain models predictions.

About the data

Between images, TFRecords, and CSV files the complete data was about 108GB (33126 samples for the training set and 10982 for the test set), most of the images had high resolution, handling all this alone was a challenge.
At the image side, we had 584 images that were melanomas and 32542 images that were not, here is an example:

Left are images without melanoma and right are images with melanomas.

As you can see if might be pretty tricky to classify those images correctly.

We also had the patients meta-data, these were basically some characteristics related to the patient:

sex - the sex of the patient (when unknown, will be blank).
age_approx - approximate patient age at the time of imaging.
anatom_site_general_challenge - location of the imaged site.
diagnosis - detailed diagnosis information.
benign_malignant - indicator of malignancy of the imaged lesion.

So, this all seems to be very interesting, it is basically why I joined the competition, and also to have an opportunity to do some more experimentations with Tensorflow, TPUs, and computer vision.

How I approached the challenge

My approach can be summarized by these topics:

Pre-process
Modeling
Ensembling

Pre-process

The pre-processing step was very straightforward the image data already had a very good resolution (1024x1024) so in order to be able to use TPUs with a good number of images per batch (64 ~ 512) and big models like EfficientNets (B0 ~ B7) all I had to do was to create auxiliary datasets with the same images but with different resolutions (ranging from 128x128 to 768x768) fortunately those datasets were kindly provided by one of the participants.
For the tabular data, no pre-processing was done, the data was already very simple, I did some experiments using features extracted from the images but it did not work very well.

Modeling

Let’s move to the most interesting part, I will describe the aspects of my best single model and then talk about the decisions behind some of those.

The model architecture was an EfficientNetB5 using only image data, the images had 512x512 resolution, I also used a cosine annealing learning rate with hard restarts and warmup with early stopping, I trained for 100 epochs with a total of 9 cycles, each cycle going from 1e-3 down to 1e-6 and a batch size of 128. With this model, I achieved 0.9470 AUC on the public leaderboard and 0.9396 AUC on the private leaderboard.

For data augmentation I used basic functions, my complete stack was a mix from shear, rotation, crop, flips, saturation, contrast, brightness, and cutout, you can check the code here. For inference, I used a lighter version of the same stack, removing shear and cutout.
Here are a few samples of augmented images:

This is how the model looked like (in Tensorflow):

def model_fn(input_shape=(256, 256, 3)):
    input_image = L.Input(shape=input_shape, name='input_image')
    base_model = efn.EfficientNetB5(input_shape=input_shape,
                                    weights='noisy-student', 
                                    include_top=False)

    x = base_model(input_image)
    x = L.GlobalAveragePooling2D()(x)
    output = L.Dense(1, activation='sigmoid', name='output')(x)

    model = Model(inputs=input_image, outputs=output)

    return model

Learning rate schedule (Y-axis is the LR and X-axis is the number of epochs)

Ok now let’s break down each component.

Why EfficientNet?

As you can see by my model backlog I have experimented with a lot of different models but after a while I kept only EfficientNet experiments, to be honest, a was also a little surprised by the how better EfficientNets performance was here, usually, some other architectures would have similar results like InceptionResNetv2, SEResNext or some variations of ResNets or DenseNets, Before the competition, I had very high hopes for the recent BiT models from Google but after many experiments with BiT I gave up with poor results.

For this specific experiment I got better results with the B5 version of EfficientNet but I got very similar results from almost all versions (B3 to B6), bigger version B7 is more difficult to train, it may require images with higher resolution and is easier to overfit with so many parameters, and smaller versions (B0 to B2) usually perform better with smaller resolutions which seem to yield slight worse results for this task.
Between the classic ImageNet weights and the improved NoisyStudent, the latter had better results.

As you can see a very basic model with just an average pooling on top of the CNN backbone was my best model. Finally, I used binary cross-entropy with label smoothing of 0.05 as the optimization loss.

Single fold training metrics from this model.

You may think that 100 epochs are a lot, and indeed it would be, but I was sampling each batch from two different datasets, a regular one and another with only malignant images, this made the model converge much faster, so I had to make each epoch use only a fraction of the total data (about 10%), roughly here every 10 epochs would be equivalent to 1 regular epoch.

Evaluating and comparing models

An important part of being effective at Kaggle competitions or any other machine learning project is to be able to quickly iterate over experiments and compare which one is the best, this will save you a lot of time and will help you focus on the most fruitful ideas. Since the early stages of the competition I developed a way to evaluate and compare my experiments, this is how it looked like for a random experiment:

Fig 2: metrics of different data slices across folds.

As you can see with information like this becomes very simples to compare models between folds and experiments, also with “Fig 2” image I can evaluate the model’s performance on different aspects of the data, this is very important to identify possible biases from the model and address them early on, and to keep in mind possible improvements, and at each portion of the data which model is better (this may help with ensembling latter).

Ensembling

For ensembling, I developed a script to brute force try many ensembling techniques, among these were regular, weighted, power, ranked, and exponential log average. In the end, the combination pointed by the script as having the best CV was also my best chosen submission.
I have used 1x EfficientNetB4 (384x384), 3x EfficientNetB4 (512x512), 1x EfficientNetB5 (512x512), and 2x XGBM models trainend using only meta-data.

Summary of what worked

EfficientNet architectures (B3 to B6) with just an average pooling layer.
Medium image resolutions (256x256 to 768x768).
Learning rate schedules with a warmup (regular cosine annealing and also cyclical with warm restarts).
Ensembling image models (CNNs) with meta-data only models (XGBM).
Augmentation helped a lot here, although was a little tricky to find the best combination.
Cutout helped fighting overfitting, I was close to getting MixUp to work but there was not enough time.
Batch sampling played a very important role in the heavily unbalanced data.
Using TPUs was crucial, having previous experience with Tensorflow API and modules helped me a lot.
TTA (test time augmentation) gave a good score boost.

What could have improved?

Comparing my models performance to the top team’s I could see that I had strong models, maybe going for diversity instead of only CV score on my ensembles could give a boost to final scores.
Maybe training a few more epochs with pseudo-labels could improve a little.

Conclusion

You can view all my experiments on the GitHub repository I created for this competition, there you will find all my experiments and also nice compilations of research materials I collected during the competition.
I also wrote a small overview at Kaggle.
There is so much more to be said about the competition and you might have a few questions as well, in any case, feel free to reach out at my LinkedIn.