Data Collection for HAM10000

Ahmed
3 min readDec 11, 2022

--

This article will discuss the data collection process for the HAM10000 dataset, which is a collection of classified skin lesion images for skin cancer detection.

Photo by National Cancer Institute on Unsplash

This article is part of a larger series on the HAM10000 dataset — please refer to the introduction article.

HAM10000 Context

HAM10000 stands for “Human Against Machine with 10000 training images.” It’s a dataset with roughly 10000 high quality, close-up dermatoscopic images of skin lesions aimed for training programs for skin cancer detection (to be exact, there are 10015 images in the HAM10000 dataset). There are seven diagnoses / labels in the dataset (below is a list of the diagnoses and their abbreviations):

  • Actinic Keratoses / Bowen’s Disease (akiec)
  • Basal Cell Carcinoma (bcc)
  • Benign Keratoses (bkl)
  • Dermatofibroma (df)
  • Melanoma (mel)
  • Melanocytic Nevi (nv)
  • Vascular Lesions (vasc)

Skin lesions that get diagnosed as Actinic Keratoses / Bowen’s Disease, Basal Cell Carcinoma, and Melanoma are typically active cancer (malignant). If they get labelled as the other diagnoses, they are typically benign / noncancerous.

Here’s are some sample images from the HAM10000 dataset:

Created by Nabib Ahmed

In addition, there’s a separate dataset of 1511 images for testing / validation. After creating a program with the HAM10000 dataset, the performance can be evaluated with these test images that the program never saw before. The labels for these diagnoses are hidden from the public, but the model’s performance can be determined by uploading to the HAM10000 challenge portal, where submitted models are evaluated against this test set.

HAM10000 Collection

The images were collected from two different sites, the Department of Dermatology at the Medical University of Vienna, Austria, and the skin cancer practice of Cliff Rosendahl in Queensland, Australia. Over 20 years of skin lesion images from the two sites were extracted, cleaned, and reformatted to create HAM10000.

The original data came in various forms, ranging from PowerPoint, Excel, and different image formats. The researchers used a combination of automation, machine learning, and manual review to extract all the images into a standard jpg format and the associated metadata into a tab file. In addition, they also filtered away bad data, such as low quality images, images with poor or no annotations, and non-dermatoscopic images. Furthermore, they address differences in the images and ensured that each had similar luminosity / hue and the lesion was centered. They also worked to ensure proper copyrighting and licensing to make the dataset public.

More specific details on their data collection process can be found in their paper. All of these steps allowed the HAM10000 dataset to be consistent, high quality, and free to use.

HAM10000 Verification

In addition, the researchers took extra steps to validate and verify the diagnoses on the data collected. There were four verification categories:

  • Histopathology: the original diagnosis was performed by specialized dermatopathologists and were manually reviewed for plausibility.
  • Confocal: this refers to reflectance confocal microscopy, which is an in-vivo imaging technique with a resolution at near-cellular level, and it was used for verification
  • Follow-up: the researchers tracked the patient associated with the skin lesion through various follow-ups and over a period of time to ensure the benign diagnosis was truly benign (i.e., didn’t develop to an active cancer)
  • Consensus: used a panel of experts where other verification techniques couldn’t be used

More details on the verification methods can be found in their paper. The researchers also provide the verification method they used per image in their metadata csv.

Conclusion

Now that we understand how the data was collected, we can focus on investigating it. Since the researchers took great lengths to ensure high data quality, we don’t have to worry about common data issues (such as missing data, incorrect labels, or low quality images).

--

--