Skin Tone Analysis with HAM10000

8 min readDec 11, 2022

The HAM10000 is a dataset with ~10K labeled images of skin lesions. The dataset also has patient / demographic information for each image such as sex, age, and location on the body. However, HAM10000 doesn’t have information on the race associated with each image.

Photo by National Cancer Institute on Unsplash

From medical studies, there is evidence that there is a relationship between race and skin cancer (source: Skin Cancer in Skin of Color and Skin Cancer Concerns in People of Color: Risk Factors and Prevention). Thus, it might be a worthwhile variable to have and investigate. In this article, we’ll explore methods to determine this variable and investigate the pattern in the dataset.

Initial Approach: Manual Annotation

A very naive way to gather race for the HAM10000 dataset is go through each image and manually annotate the race variable. While viable, there are several issues:

Time: it requires a non-trivial amount of time to annotate. For simplicity, let’s assume it takes 30 seconds to annotate each image (i.e., looking at the image and then assigning a race), with 10K images, the entire process would take 5000 minutes (which is roughly 3.5 days)
Scale: this process is very manual making it difficulty to execute at very large size. Imagine getting 100K more images to annotate — it would take roughly a month based on our time calculations
Bias: annotators might have their own bias when labeling the images (and they might not even be aware of it). To resolve this, we could employ multiple annotators and then use a voting mechanism, however the trade-off is that we need more annotators.

Beyond these issues, a more jarring problem is the challenge of determining race from the images itself. Below are some sample images from the dataset:

The images are close-up shots of the skin lesion. We have a very limited view on the patient — the primary piece of information we can extract is skin tone. This makes determining race extremely difficult as skin tones don’t easily map to race (i.e., the range of skin tone for a specific race is very wide and multiple races can share the same skin tone).

Skin Tone as a Proxy Variable

With the information available, race is a hard to measure. Instead of trying to estimate race, which might result in errors, we can instead use skin tone, which is much easier to measure, in lieu of race. It isn’t a perfect substitution, but it is a more doable, precise variable. In this context, skin tone is serving as a proxy variable.

Furthermore, we can automate our skin tone annotation by leveraging computer vision to analyze the pixels. A simple approach is to take an average of all the pixel values in the image. This can be quite simple using the NumPy library’s mean function along the three color channels. This average will reflect the average color in the image, representing the overall skin tone. Below are the results of this approach on the sample images (top is the image and the bottom is the simple average color).

A big issue with this approach however is that the skin lesion might bias the skin tone. Some skin lesions can be a very different color from the person’s skin tone. For example, in the second image from above, notice the skin lesion is much darker than the adjacent skin around it. Since the skin lesion isn’t representative of the skin tone, we would want to isolate it from our average.

Refined Skin Tone Measurement

For a more accurate skin tone, we want to exclude the skin lesion. According to the dataset specification, the researchers who made the dataset attempted to centered all the skin lesions. This means we should not average the center most image pixels as they are part of the skin lesions. But as the sample images from above show, the size of the skin lesion isn’t consistent. How should we address this problem?

A naive approach is to manually define the area to include for each image, but we run into similar issues as described earlier with manually annotating race. A more algorithmic method we could be using is edge analysis, where we look for sharp changes in the images that represents the break from the skin lesion to normal skin. However, this approach introduces a new problem with accurate edge detection: how strong should our edges be to be considered? What happens when the skin lesion is very similar in color and pattern as the surrounding skin? This approach may end up opening more problems than closing them.

Ultimately, we need to make a trade-off between precision and feasibility. The two methods proposed can be very precise but ultimately requires a lot of implementation work / complexity. For this article, we will forgo some precision for easier feasibility.

Our method will be sampling some images at random, manually determining how big the skin lesion is for each (i.e., eye-balling it), and then using the average value to determine what to exclude. Diving deeper on determining the skin lesion size, we know it’s centered on the image, and we’ll assume that it can be captured in a circular region. Thus, our exercise becomes finding the radius of a circle that’s big enough to capture the skin lesion. We’ll also add some slack (we’ll also eye-ball this) and make the region a little bit bigger than the skin lesion. See the below the graphic:

The region outside of the circle represents normal skin and should be used to find average skin tone. The radius of the circle to use will be determined by the average of the radii we get from our randomly selected samples. For this article, we’ll use a sample of the above four images (so the average radius value will be 197.5 pixels — since we’re working with pixels, which are integers, we’ll round this number down to 197). Employing this strategy, we get these results on the sample images (top is the image and the bottom is the refined average color):

Since it maybe hard to visually distinguish the difference between the simple and refined average colors, here are the exact RGB values from each process for the sample images (from left to right):

Leftmost: simple average RGB values were (207, 166, 142) and refined average RGB values were (207, 173, 150)
Second leftmost: simple average RGB values were (186, 130, 137) and refined average RGB values were (215, 159, 172)
Second rightmost: simple average RGB values were (208, 111, 105) and refined average RGB values were (212, 120, 115)
Rightmost: simple average RGB values were (246, 178, 191) and refined average RGB values were (247, 180, 193)

Notice that second leftmost image, the one that we pointed out earlier having a lesion much darker than the adjacent skin, had the biggest changes in RGB values.

Let’s discuss the strengths and limitations of this approach. The strengths are that it’s quick and easy to implement, with much less manual process, and tunable, meaning we can experiment with parameters (e.g., number of images to sample and amount of slack to add) to potentially improve results. The limitations are in the correctness of the algorithm: the circular region we define might not adequately capture every skin lesion (consider oval shaped lesions, very large lesions, or images with flaring).

Other issues with our approach might be bias from illumination and other skin spots. Images might have different levels of lighting which affects the skin tone average. Additionally, some people might have additional skin spots and hair. For example:

From left to right,

The first image has a lesion much bigger than the average radius we calculated — in fact, it is so big that our method of bounding it in a circle wouldn’t work
The second image has a shadow on the edges which may skew to a darker skin tone compared to the true skin tone
The third image has flaring / dark rings at the edges and corners as well as hair that may skew to a darker skin tone compared to the true skin tone
The fourth image has some debris / entity (some clear wax)

To address these issues, the trade-off is a more manual cleaning / inspection process. For this article, we’ll stick with the results we’ve achieved so far with this refined average.

Skin Tone Analysis

With the skin tone methodology determined, we can now look at its distribution. For each image in HAM10000, we’ll find the refined average color. Then, we’ll plot a histogram of each color channel (red, green, and blue) — the regions with the highest frequency and overlap with each other will indicate the most common skin tones.

From the above histogram, we see that all three color channels have a unimodal distribution. This implies that one skin tone dominates in the dataset. If there was an equal representation of all skin tones, then we’d expect the histograms to be multi-modal or uniform across the pixel values. Instead, we see that the red color channel has the greatest frequency of pixel values roughly between 180 to 225, the green color channel has the greatest frequency of pixel values roughly between 135 to 165, and the blue color channel has the greatest frequency of pixel values roughly between between 135 to 165.

To get intuition on what skin tones are most represented in HAM10000, we can generate sample skin tones based on the distribution we see above. Below are the results:

The above samples indicate that the most common skin tone in the dataset is some reddish tone. Combining this with our general knowledge of the data and explorations of the images, this would make sense as much of the adjacent skin to lesions are typically inflamed (thus having a reddish color). Furthermore, from our research in the data collection process, we know that the data comes from patients in Vienna and Australia, both two countries with a relatively homogenous population when it comes to skin tone.

Conclusion

In this article, we were motivated to determine the race of the patient since race can be an important attribute for cancer. It would be very difficult to determine race, so we instead focused on skin tone, a proxy variable for race. We determined skin tone by averaging the area around the lesion and using random sampling to determine the area of the lesion. Our results show that one skin tone dominates the dataset which matches our intuition from the data exploration and data collection phase.