Powering inclusive search & recommendations with our new visual skin tone model
Nadia Fawaz | Research Scientist & Tech Lead, Applied Science, Bhawna Juneja | Software Engineer, Search Quality, David Xue | Software Engineer, Visual Search
To truly bring everyone the inspiration to create a life they love, Pinterest is committed to content diversity and to developing inclusive search and recommendation engines. A top request we hear from Pinners is that they want to feel represented in the product, which is why we built our first version of skin tone ranges, an inclusive search feature, in 2018. We’re proud to introduce the latest version of skin tone ranges, a newly built in-house technology. These new skin tone ranges are paving the way for more inclusive inspirations to be recommended in search, as well as in our augmented reality technology, Try on, and are driving initiatives for more diverse recommendations across the platform.
Developing more inclusive skin tone ranges
Trying to understand the skin tone range in an image is a complex challenge for computer vision systems, given the impact of shadows, different lighting, and a variety of other impediments. Developing inclusive skin tone ranges required an end-to-end iterative process to build, evaluate and improve performance over several versions. While qualitative evaluation could help reveal issues, in order to make progress, we needed to measure performance gaps across skin tone ranges and understand the error patterns for each range.
Starting with diverse data
We labeled a diverse set of beauty images covering a wide range of skin tones to evaluate the system performance during development. Measuring performance is important to assess progress, however coarse aggregate metrics over the entire data, such as accuracy, are not sufficient, as the aggregation may hide performance discrepancies between skin tone ranges. To quantify performance biases, we went beyond overall aggregates and computed granular metrics per skin tone range, including precision, recall, and F1-score. Per range metrics would show if errors disproportionately affected some ranges. We also used confusion matrices to analyze error patterns for each range. The matrices would reveal if a model failed to predict a skin tone for images in a range, leading to a very low recall and F1-score for that range, or if it failed to distinguish images from different ranges and misclassified them, impacting recall and precision for several ranges, as in the examples below.
To understand the root-causes of issues, we performed an error analysis of the components of the skin tone system based on their output. At a high level, a skin tone system may include
- a detection model that attempts to determine the presence and location of a face in a beauty image, but does not attempt to recognize an individual person’s face
- a color extraction module
- a scorer and thresholder to estimate the skin tone range
Analyzing the score distributions per skin tone range over the diverse dataset can show if the score distributions are separable or if they overlap, and if the thresholds are out-of-phase with the diverse data, as in the example above. Both issues can be amplified by color extraction failures in challenging lighting conditions. Studying face detection errors can reveal if the model fails to detect faces in beauty images with a darker skin tone at significantly higher rates than in images with lighter skin tones, which would preclude the system from generating a skin tone range for these images. This type of bias in face detection models can carry over to the skin tone system, and no amount of downstream post-processing for fairness on the output of the system can correct such upstream bias. Biases in face detection have been analyzed previously in the Gender Shades study by Joy Buolamwini and Timnit Gebru. Requiring face detection to predict skin tone also limits the scope of the system, as it cannot handle images of other body parts such as manicured hands, and it contributes to the overall system latency and scalability challenges.
Through analysis, we reached the conclusion that to improve fairness in performance across all skin tone ranges, we needed to build an end-to-end system with bias mitigation.
Developing new skin tone ranges by mitigating biases
Visual skin tone ranges V1: Mitigating bias
We developed the new visual skin tone v1 ranges based on visual input and focused on:
- mitigating biases to make skin tone perform outstandingly well across all ranges
- creating a signal that doesn’t require the presence of a full front-facing face, but also works for partial faces or other body parts
- extending to applications beyond beauty, such as fashion
- leveraging this more reliable signal as a building block to improve fairness and reduce potential bias in other ML models
The visual skin tone v1 leverages several computer vision techniques to estimate the skin tone range in a beauty image. After exposure correction, a face detection model identifies the face area and landmarks corresponding to facial features such as eyes, eyebrows, nose, mouth and face edge. This face detection model has better coverage on images with darker skin tones. Some facial features, such as eyes and lips, are then cropped out, and binary erosion is applied to remove hair and edge noise and finally produce a face skin mask. If face detection fails to identify a face in the image, for example in images of other body parts, Hue Saturation Value (HSV) processing attempts to locate skin pixels and produces a skin mask. The color extraction module then estimates a dominant color based on the RGB distribution of the skin mask pixels. The dominant color is converted to the LAB space, and the individual topology angle (ITA) is computed as a nonlinear function of L and B coordinates. The resulting ITA scores are more separable across ranges. Using a diverse dataset of images, fairness aware tuning is performed on the ITA scores to produce a skin tone prediction while mitigating biases in performance between ranges.
Evaluation of the visual skin tone v1 on the diverse set of beauty Pins showed ~3x higher accuracy on the predicted skin tone. Moreover, per range precision, recall and F1-score metrics increased for all ranges. We observed ~10x higher recall and ~6x higher F1-score on darker skin tones. The new model reduced biases in performance across skin tone ranges, and led to a major increase in coverage of skin tone ranges for billions of images in our beauty, women’s and men’s fashion corpora.
Beyond offline evaluation, having humans in the loop can significantly improve performance by integrating feedback from human evaluation, users and communities. For instance, we conducted several rounds of qualitative review and annotation of the skin tone inference results on diverse images to identify new error patterns and inform training data collection and modeling choices, as we iterated on the model. We also leveraged side-by-side comparisons of results in inclusive bug bashes with a diverse group of participants. Regular quantitative and qualitative evaluations help improve quality over time. In production, we ran experiments to evaluate the new skin tone v1, and built dashboards to monitor the diversity of content served.
Visual skin tone ranges V2: Keep learning
While iterating on skin tone v1, we first focused on getting the simpler cases right, such as front-facing faces in beauty portrait images. As we later expanded to the broader cases of rotated faces, different lighting conditions, occlusions such as facial hair, sunglasses, face masks, other body parts, and integrated more images from diverse communities, we learned from the errors of skin tone v1 to develop a more robust skin tone v2. We worked closely with designers to iterate and develop clear labeling guidelines for tens of thousands of images. Iterating on the model and the collection of its training and evaluation data by actively integrating learnings from earlier versions allowed the model to improve over time. This helped expand its application beyond beauty images to the broader context of fashion.
The need to handle more complex images led us to move away from face detection, and to take a new approach for skin tone v2 based on an end-to-end CNN model from the raw images. We first trained a ResNet model to learn skin tone from a more diverse set of images from beauty and fashion, including v1 error cases. This model outperformed v1 when evaluated on larger, more challenging data. We then considered adding skin tone prediction as a new jointly trained head in the multi-task Unified Embedding model. This approach led to further performance improvements, but at the cost of increased complexity and of coupling with the multi-head development and release schedule. Eventually, we used the 2048-dimensional binarized Unified Embedding as input to a multilayer perceptron (MLP), trained using dropout and a softmax with cross-entropy loss to predict skin tone ranges. This led to significant performance enhancements for all ranges, benefiting from the information captured in our existing embedding while requiring far less computation.
Productionizing visual skin tone at scale
To productionize skin tone v1 for billions of beauty and fashion images, we first identified which Pin images were relevant for skin tone prediction. We leveraged several Pinterest signals, such as Pin2Interest to gather beauty and fashion content and our embedding-based visual Image Style and Shopping Style signals, to filter out irrelevant Pins, like product images, which helped with scale and precision by narrowing the image corpus.
To generate skin tone ranges for existing and new images for skin tone v1, we used our GPU-enabled C++ service for image-based models, that supports both real-time online extraction and offline extraction in two stages — an ad hoc backfill and a scheduled incremental workflow.
For visual skin tone v2, our embedding-based feature extractor utilizes pre-computed unified visual embedding as input features to the MLP. This approach uses Spark and CPU Hadoop clusters to significantly speed up skin tone classification in a cost-effective manner. Without having to process the image pixels, our embedding-based approach reduces the time needed to compute the backfill for billions of Pin images from nearly a week to under an hour.
Improving skin tone ranges in search for global audiences
Skin tone ranges provide Pinners the option to filter beauty results by a skin tone range of their choice, represented by four palettes. The improved skin tone models gave us the confidence to make skin tone ranges more prominent in the product and launch internationally in search.
Deploying the new skin tone v1 for beauty search queries first required indexing the skin tone signal as a discrete feature among four ranges and the prediction method — face detection or HSV processing. To evaluate skin tone v1 in search, we first gathered qualitative feedback from a diverse set of internal participants and then launched an experiment to assess the online performance at scale. The internal evaluation and the experiment analysis showed a clear improvement in precision and recall for the new model. The model was more accurate at classifying pin images into their respective skin tone ranges, especially the darker ranges, leading to large gains in precision and coverage in search results. We also noticed that skin tone range adoption rates in English speaking countries were comparable to the U.S., and both increased with the combined launches of the redesigned skin tone range UI and the new skin tone range model.
Skin tone ranges in similar looks for AR Try on
Try on was developed with inclusion in mind at the outset of Pinterest AR, supported by visual skin tone v1. The Similar Looks module in the AR Try on for lipstick experience allows users to discover makeup looks with similar lip styles. By integrating skin tone ranges in Similar Looks, users can filter inspiration looks by a skin tone range of their choice.
To build Similar Looks, the makeup parameters of a beauty pin are estimated by DNN models trained on a high quality, human-curated diverse set of tens of thousands of beauty images spanning a wide range of skin tones. First, an embedding-based DNN classifier for the Try-On Taxonomy of Image Style is trained with PyTorch using the Unified Embedding as input. Lipstick parameter extraction is performed using a cascade consisting of a face detector, landmark detector, and DNN-based parameter regressor. The visual skin tone v1 is indexed and combined with a lightweight approach to retrieve Makeup Look pins in the selected skin tone range with lipstick parameters most similar to the color of the query makeup product in perceptual color space. Together these components form a new kind of visual discovery experience for makeup try-on, connecting individual products to an inspirational and diverse set of beauty Pins.
Content diversity understanding and diversification
Leveraging diversity signals such as skin tone helps us analyze and understand the diversity of our content, as well as how it is surfaced and engaged with. With skin tone v1, we quadrupled our skin tone range coverage of beauty and fashion content. [Source: Pinterest Internal Data, April 2020] Our skin tone signal is now 3x as likely to detect multiple skin tone ranges in the top search results [Pinterest Internal data, July 2020], allowing more accurate measurements of the diversity of content served. Such analysis can help inform work around diversification of content inventory and its distribution on Pinterest.
The road ahead
Through our experience developing skin tone ranges and integrating them in our search and AR Try on products, we learned the importance of building ML systems with inclusion by design and respect for user privacy at the heart of technical choices. In a multi-disciplinary collaboration between engineering and teams spanning many organizations, we are building on this foundation to further improve skin tone ranges, develop diversity signals, diversify search results and recommendations in various surfaces, and expand the inclusive product experience to more content and domains globally.
This work is the result of a cross-functional collaboration between many teams. Many thanks to Josh Beal, Laksh Bhasin, Lulu Cheng, Nadia Fawaz, Angela Guo, Edmarc Hedrick, Emma Herold, Ryan James, Nancy Jeng, Bhawna Juneja, Dmitry Kislyuk, Molly Marriner, Candice Morgan, Monica Pangilinan, Seth Dong Huk Park, Zhdan Philippov, Rajat Raina, Chuck Rosenberg, Marta Scotto, Annie Ta, Michael Tran, Eric Tzeng, David Xue.