Melanoma Skin Binary Image Classification using XGBoost Algorithm

Tyler Yarnell
6 min readApr 10, 2019

Tyler Yarnell, Ernest Bonat, Ph.D.

General Introduction and Motivation

Cancer is a vociferous disease. Not only is itself destructive to the body, psyche and community of the affected, treatment is also painful and tedious. We have seen encouraging trends in cancer survivability rates, though there is still a gap to zero. An insightful trend is the distribution of survivability rates across Stages (I-IV). Not surprisingly, cancer caught early is the easiest to treat and results in high survivability. So, a huge component to reducing mortality is early detection.

Melanoma Survival Rate: (ACS Annual Report)

Melanoma is a form of skin cancer that usually is visible, so detection using image classification can be applied to quickly and easily predict the type of spot that visibly appears on the surface of the skin. Further, the tooling required of medical professionals utilizing Machine Learning (ML) technologies can be constructed to run on relatively cheap platforms compared to traditional modes of detection (e.g. biopsy) which are costly, less timely, and require higher overhead. Certainly, that information would be useful to medical professionals in their diagnoses.

Healthcare Industry Collaborations

There are a host of industry partnerships working towards improving the technology applied to cancer research. These organizations provide accessible data and a platform for community members to contribute to research, along with standards and documentation guidelines. You can read more if your interested about one such organization - ISIC.

World Competitions

ISIC competitions started in 2016. Each year, the competition structure has evolved and improved. By 2018, the competition fielded 112 teams, features a robust website, and publishes various white papers describing team’s approaches. The 2018 competition is structured into 3 main tasks:

  1. Lesion Boundary Segmentation
  2. Lesion Attribute Detections
  3. Lesion Diagnosis (Classification)

Many teams have found highly accurate solutions, we want to test our implementation against others. A few notable approaches are:

a) Basic Random Forest w/Dimensionality Reduction via PCA (primary metric: AUC at 80%)

b) Convolutional Neural Network (CNN) vs Dematologists (primary metric: ROC: 87%)

Extreme Gradient Boosting (XGBoost) Object-Oriented Wrapper

XGBoost is a common ML algorithm today. It is ubiquitous in Kaggle competitions both for its effectiveness and ease of use. In fact, it’s so successful that it is often used off-the-shelf by many Data Scientists just to explore the feature space, do some initial evaluation benchmarking, and assist with feature engineering work.

Project tree structure for the training application.

We created an object-oriented wrapper around the XGBoost algorithm, integrating it with our ML pipeline. This is a more powerful and flexible approach than a traditional, Pythonic implementation of the analysis (see Refactoring Python Code for Machine Learning Projects. Python “Spaghetti Code” Everywhere!” for more information). This enables refactoring code without inhibiting execution of a new training procedure, addition of new methods, and generally helps with code maintenance.

Now that we have a Python boosting library, we have a way of easily implementing our approach on new datasets, using scalable Virtual Machines on platforms such as Google Cloud Platform (GCP) ML Engine.

Data Preprocessing

The dataset is a series of 150x150 benign and malignant Melanoma images taken from the ISIC Archive. In order to prepare the image data to be fed into any algorithm, we have to turn it into a numpy array data type that an ML algorithm can read. For this we went through the following steps:

  1. Reshaping
  2. Flattening
  3. Casting
  4. Normalization Scaling

The output is a flattened file with 67,500 columns, where each column represents a pixel-RGB value between 0 and 255. Each record is a new image (consisting of 12,000 images). We store the flattened images as an H5 data file. HDF5 file storage is generally faster and more manageable than CSV files. It loads quicker into a pandas dataframe object and is generally easy to implement given the pandas read_hdf() method.

Selecting Training, Validation and Testing Data

We split our approximately 12K records by selecting 10% of the records as a hold-out test set. The remaining data was then used as a training dataset for both our model training, testing and subsequent cross-validation/hyperparameter tuning exercise.

Using 2 different implementations of a fit-predict method, we were able to package this as a training application and execute it using Google Cloud’s ML-Engine. Here is a sample or our training function:

Fit-predict function for model training.

In order to properly train the model, we utilized Google’s ML Engine, which is easy enough to work with if you follow their step-by-step documentation. One of the advantages of the ML Engine is that any data scientist can focus on building and training models, as opposed to messing with dev-ops and managing compute machines. The ML Engine performs all the necessary start-up and tear down for you. Additionally, there are other nifty features like easy to use Bayesian optimization for Hyperparameter tuning. You simply pass a machine config file to the gcloud command and the ML Engine does the heavy lifting.

Model accuracy across multiple training jobs and trials.

In order to get a relatively condensed hyper-cube, we used a parameter search space following this guide. The result was 3 parallel three parallel training runs using 10-, 15-, and 20- fold cross validation. As you can see, within 40 rounds of training the application finds a highly accurate set of hyper parameters within the search space, cutting down on training time.

Finally, we compared the most accurate training run (after retraining a new model on all the training data using our most accurate hyperparameters) and compared to our test set. Further, we scored along multiple evaluation metrics due to the imbalanced dataset.

Final Results

CONFUSION MATRIX:

[[1932 68]

[ 63 385]]

CLASSIFICATION REPORT:

ACCURACY SCORE: 94.65 %

Implications

  1. If we perform a binary classification of the ISIC dataset with default hyperparameters the algorithm yields an accuracy score of 92.73%. This is quite a powerful first-step.
  2. If we do some simple hyper parameter optimization, we can easily improve that score to 94%. That’s not bad! If your competing in a competition, obviously you’d need to do better to win. However, in industry where time is a very real constraint we can see that xgboost is a practical, impactful approach to modeling that can yield impressive results.
  3. The ISIC organization hosts quite a lot of competitions. We would like to extend the ideas from this post into lesion boundary segmentation, multi-class classification, and lesion attribution detection. The hope is that the segmentation and attribute detection would improve the accuracy for controlling for things like lesion pigment variation, and ultimately improve the modeling efforts.

--

--