Exploring the HAM10000 Dataset

5 min readDec 11, 2022

The HAM10000 dataset is a collection of roughly 10K images of skin lesions for skin cancer detection. In previous articles, we discussed the game-plan and data collection for HAM10000. This articles will explore the data and perform early data analysis (EDA).

Available Variables / Features

In our dataset, we have 10015 close-up, high quality, centered, dermatoscopic images of skin lesions. All the images have color, with height 450 pixels and width 600 pixels — this is very convenient since we don’t have to worry about standardizing our images to the same size. Below are some sample images from the HAM10000 dataset:

We also have metadata associated with each image (found in a metadata.tab file). The metadata variables available to us are:

lesion_id: this refers to the patient associated with the skin lesion and will be in the form of HAM_X where X is a 7-digit unique identifier number (multiple skin lesions can come from the same patient)
image_id: this refers to the image name in the HAM10000 data folders and will be in the form ISIC_X where X is a 7-digit unique identifier number (each image has its own ISIC number); this will be useful for finding the image in the data directory
dx: this refers to the diagnosis; the column uses the following abbreviation scheme:
akiec: atinic keratoses and intraepithelial carcinoma / Bowen’s disease bcc: basal cell carcinoma
bkl: benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses)
df: dermatofibroma
mel: melanoma
nv: melanocytic nevi
vasc: vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage)
dx_type: this refers to how the skin lesion is verified; it will be either histopathology (abbreviated as histo), confocal, consensus, or follow-up and more information on verification is available here
age: this refers to the age of the patient associated with the skin lesion image
sex: this refers to the sex of the patient associated with the skin lesion image (will be male, female, or unknown)
localization: this refers to the location of the skin lesion on the body (e.g., back, hand, scalp, etc.)
dataset: this refers to the origin dataset for the skin lesion (which site / group it was collected from)

Since our goal is to determine the skin cancer diagnosis from the skin lesion, the key features of interest to analyze would be the images, the labels (dx), and the demographic / patient information (age, sex, and localization). The other variables, while insightful, provide mostly information on how the data was collected, verified, and organized, thus we won’t focus on them for this article.

Diagnosis (Label) Distribution

Let’s see the distribution of diagnoses / labels in our dataset (this is the dx column in the metadata file).

The above bar chart shows that our dataset is extremely unbalanced. The nv category has 6705 images in the dataset, which means it has more representation than all the other categories combined. Having such a skewed dataset can present problems when creating ML based algorithms (e.g., the model may always naively pick the nv category as it dominates the rest). In future article, we’ll explore how to handle and address the dataset imbalance.

Demographic / Patient Distributions

In this section, we’ll explore the age, sex, and localization variables. For age, we’ll use a histogram since it’s a numeric variable. For sex and localization, we’ll use a bar chart since they are categorical variables.

The range of ages in the HAM10000 dataset is from 0 to 85 years. The distribution appears to be bimodal, with a large number of patients in the 35–50 age range and 60–75 age range. Furthermore, we see that patients in the dataset are generally older. This distribution matches the general trend about skin cancer, in that it typically affects older individuals. From this distribution, there doesn’t seem to be any anomalies in the ages.

We see that the male category has the most representation and there are roughly 1000 more images (10% more representation) for the male category than the female category. Ideally, we would want equal representation, but given that the imbalance is relatively small, we can de-prioritize investigating / address it. Furthermore, we see 57 images with an unknown sex — these indicate images where the annotation on sex was not available. Since we have such a small percent of the data having unknown sex, we could potentially discard them, however since our first series of analysis will only utilize the images, we can keep them around until we start exploring multimodal solutions (where we will use this information alongside the image).

The bar chart clearly shows that the localization of the skin lesion is very imbalanced. A large number of skin lesions are located on the back, lower extremity, and the trunk while the acral, genital, and ears combined barely cross 100 examples (~1% of the dataset). Given that skin lesions can look very different depending on which part of the body they are in, it may ultimately affect model’s performance and its ability to predict correctly on skin lesions found in less represented locations in the body. As we push to make the best possible program, this maybe a potential avenue to explore and evaluate against (i.e., see how well it performs segmented by the localization of the skin lesion).

Conclusion

In this article, we performed early data analysis on the HAM10000 dataset. From our exploration, our key takeaways are:

All the images are the same size and format (centered, close-up images of the skin lesion); this reduces the amount of preprocess or data wrangling we need to do
The dataset is highly imbalanced on the diagnoses / labels which might affect our program’s performance; in later articles, we’ll discuss methods to address imbalanced datasets
The distribution of skin lesion localization isn’t balanced either; in later articles, we’ll dive deeper into localization patterns
The distribution for sex is relatively balanced (e.g., roughly equal amount of both sexes) and the age of the patients skew higher (which makes intuitive sense since people are more likely to develop cancer as they get older)

In later articles, we’ll dive deeper on the HAM10000 dataset informed by our analysis in this article.

Exploring the HAM10000 Dataset

Available Variables / Features

Diagnosis (Label) Distribution

Demographic / Patient Distributions

Conclusion

Written by Ahmed