BeautyExtract: a data product to empower customized skincare recommendations

Background: During my fellowship at Insight Data Science, I took on a consulting project to work with a skincare startup, called Proven. Cofounders of Proven discovered that 55% of people regret their off-the-shelf skincare purchases. So they came up with the idea of providing customized skincare regimens to fill the gap.

The Project: Proven’s goal is to create customized formulas targeting different skin concerns. They needed help with product and market research so that they could find niches to better serve the dissatisfied customers. That’s where I came in — helping them extract insights from user reviews and ingredients on skincare products currently available in the market.

Data: The data I am working with consist of 420,000 reviews from 2500 unique skincare products from Data is a mixture of structured and unstructured datasets.

Sephora dataset

Amongst all the data, product ingredients and review texts strike to me as the two gold mines. There is potential to unlock a ton of value from the seemingly messy and text-heavy data. Different brands list their product ingredients on Sephora in different format s— sometimes, the actual ingredient list is buried underneath the heavy marketing language (see example on the left), which makes cleaning the data a fun challenge.

The Goals: Having no background in formulation chemistry, my immediate thought on the ingredients data is that: why is water the first thing on almost every list? What else is common amongst all the products? Thanks to the chemists in my Insight cohort, I’ve learned that there are quite a few common base chemicals in skincare products, namely, binders, emulsifiers and preservatives. What Proven is interested in are the active ingredients that cure someone’s skin concerns. Therefore, my goal #1 is to extract only the active ingredients from all the skincare products available in the dataset.

Proven also expressed their desire to extract sentiments from the user reviews. My goal #2 is then to help them uncover why users rave or hate about a product.

Approach to Goal #1: My solution on this problem is to apply a topic model on the ingredients data. Here, I’m following a fairly standard topic modelling pipeline:

Topic Model Pipeline
  1. Structure the corpus to be a list of documents, where each document is a string of ingredients for one product.
  2. Apply NLP techniques (tokenizing, stemming, removing stop words) to preprocess the data.
  3. Convert the processed data to a matrix of tf-idf features — tf-idf helps with stripping out the common ingredients in the corpus by assigning them lower weights.
  4. Non-Negative Matrix Factorization (NMF) model is then applied on the tf-idf matrix. NMF is a linear-algebra based algorithm that aims to extract topics by producing two matrices (feature matrix, and weight matrix) where multiplied together reproduce the original td-idf matrix with the lowest error.
  5. Model validation is a known challenge in topic modelling amongst industry and academia. It requires a lot of hand holding in tuning the parameters as well as evaluating the model output. Ultimately, success of a topic model is measured by human interpretability. Same as other unsupervised learning, a good topic model need to provide good insights.

Results to Goal #1: When I set the number of topics to be 15, the model output the most interpretable results. Majority of the 15 topics make a lot sense in that the words in each topic highly relate to each other. Here are 4 examples of the topics. Can you spot the similarities across the words?

Thanks to the brilliant chemists in my cohort, I’m able to hand label all of them with meaningful names. Here on Topic 13, I have glycolic, salicylic and benzoate, which are acids that treat acne. On Topic 15, I have charcoal, clay and kaolin that are hygroscopic (meaning “absorbing water from air”), which makes them good candidates for moisturizers and hydrating serums.

Approach to Goal #2: I then apply similar NLP techniques to the user review texts. I first use ngram to break down the sentences into n-word phrases. I then apply Tf-idf to surface frequently mentioned phrases in the corpus. Again, the goal here is to provide Proven with market insights and find out why people love a product vs. what are the common complaints.

Results to Goal #2: I’m using D3 to plot the sentiment frequency for 5 star reviews and 1 star reviews separately. Amongst the 5-star reviews, consumers “love love love” a product because “little go a long way” and it “leaves their skin soft”.

Amongst the 1-star reviews, lots of people mentioned that they “really wanted to love” the product, and they had “high hopes”. But somehow the product failed to deliever to their expectations.

The key takeways for Proven are:

  • Design products with good value to price ratio in mind
  • Do not over promise consumers what their products can do

After all, one star reviews could cost them significant loss in revenue:

Conclusion: Overall, I delivered a data product to Proven that include:

  • A topic model that extracts and summarizes active ingredients to help with their product research and engineering
  • A sentiment analysis tool that allows Proven to find niches to serve unsatisfied customers and generate marketing insights

The founder of Proven is pleased with my work and plans to feed the textual features I extracted into her predictive model. With Proven being accepted into Y Combinator, I’m excited to see my code going into production and moving the company forward.

About Me: Prior to Insight, I spent 4 years in the finance industry building statistical and machine learning models on stock markets time series. I chose this project to expand my knowledge on NLP and unsupervised learning. It turned out to be a fun and rewarding challenge. I’m currently learning to use textual data as features in predictive modeling. More to come!