Automating Product Categorization

Published in

CODE + CONTOUR by IPSY

9 min readJul 15, 2021

Contribution from Matías Marenchino and James Faghmous

At BFA, we source thousands of beauty items across all our brands, and cataloging each item accurately is critical to empowering our members to express their unique beauty. Having accurate product metadata impacts several areas of our businesses, from sourcing to customer care to machine learning and artificial intelligence. While our members wait for their custom beauty products to arrive, our teams work to create a delightful experience with personalization at every touchpoint.

In our databases, we have thousands of products. Many of them have metadata like “product_category,” “ingredients,” “product_weight,” “product_dimensions,” and more. However, most of these metadata are entered manually, a time-consuming and error-prone approach. So, we ask ourselves: Could we leverage artificial intelligence (AI) and machine learning (ML) to streamline this process?

One team tackled this question at our first-ever BFA Hackathon. This was their winning entry, eventually going from idea to minimum viable product (MVP) in just 72 hours!

The Challenge: Data Consistency & Accuracy

To build the best personalization experience, we need to understand our products at the correct level of granularity to infer relationships between products and identify products that subscribers are likely to love. We face significant challenges in representing our data accurately and consistently across brands. Here are some challenges that are relevant to personalization and recommendation systems:

1. Inconsistent Historical Data

While we have product data for the 10 years we have been in operations, we face data consistency challenges such as data censoring, missing data, and human error. Since manually mending the data is virtually impossible, an automated approach would be extremely helpful in enhancing the predictive capabilities for our ML models. This follows a similar trend that we see in the industry of automatically labeling data with tools such as Snorkel AI.

2. Human Inference of Product Metadata

To give some context, our products are first set up by our merchandising team. Each merchandising expert examines the product and “infers” the various product metadata to enter into the system. This process — reading about a product, identifying its category and any other metadata, and then entering it in a spreadsheet — takes roughly about three minutes per product. Extrapolate that to hundreds or even thousands of products each month, and you have an extremely time-consuming, labor-intensive task — and that’s just the process for a single product. We also have combinations of products, bundles, and other offers where multiple products need to be reviewed and multiple decisions made about what metadata to enter. Needless to say, this approach can be tedious, inconsistent, and error-prone — imagine searching for lipstick and getting a face moisturizer in the results!

3. Heterogeneous Data Sources and Label Inconsistency

Our product data comes from different sources. Across our brands, we have multiple product databases that each contain different hierarchies of product category trees (plus other subtle differences), and this can lead to label inconsistencies within products — not to mention the discrepancies that can result from different people inputting labels. For example, one person may label “Self Tanner” as “Skin > Treatment” (this represents “root category > category”), while another person may label it as “Body > Sun” (view the product here). This increases the chances of label inconsistencies occurring, eventually affecting downstream operations.

The Solution: Automating Product Categorization

When prioritizing how to tackle these issues, we decided to focus on automating product categorization. The idea was not to entirely replace the people who are manually performing the task, but to have a partial automation system where 90–95% of the time the machine learning model can extract the product categories automatically while receiving human input when uncertainty is high (e.g., when the probability of a product having a certain category is below a given threshold). Talking in business terms and future scope, this approach could help with the following:

Enhancing years of historical data, which could lead to improved personalization and higher overall subscriber satisfaction.
Assisting the merchandising team in their daily tasks, allowing them to focus more on other projects.
Creating a consistent set of product categories and metadata across all our brands, thus resulting in higher-quality data for all teams.

Data Gathering & Processing

We gathered all the product information from our data lake on the Databricks platform using Spark, then made a consolidated dataset with information like product brand, description, ingredients, and more.

To keep it simple for the initial iteration, we dropped all the non-string fields like product weight and dimensions (among other fields), and focused only on the StringType fields to build a proof of concept without getting overwhelmed by the underlying complexity of the task. Since there were many products without descriptions, we concatenated “product_description,” “brand_name,” and “product_ingredients” together to make a new field called “product_info,” and we then removed all other fields except the category label (which we wanted to predict) along with any product rows which had no “product_info”. Finally, we gathered thousands of products with two fields: “product_info” and “category”.

Feature Engineering

For feature engineering, we applied the usual natural language processing (NLP) suspects. We split the “product_info” into tokens, and then normalized and lemmatized them (i.e., derived their root words). We further removed any stop words and generated bi-grams (note: After we looked into both bi-grams and tri-grams, we concluded that bi-grams would be enough to enhance the model's predictive capability). Finally, we evaluated the TF-IDF of the generated bi-grams. We saved the above processing pipeline to be exactly used in the same way for the testing and actual inference to avoid any inconsistency.

For the “category”, we transformed it into numeric labels using a StringIndexer (i.e., we assigned each label a number). Again, we saved the same vocabulary to be used at the time of testing and actual inference to avoid any inconsistency.

Model Training & Testing

Our goal was to have our model learn about the categorization of each product without being biased towards any single category. This meant we needed to have an equal representation of each category in the training, validation, and testing datasets, which we achieved by conducting stratified random sampling instead of regular random sampling. The exception to this approach was certain situations where some categories might be more important than others (and would thereby call for more accuracy). We would want our training dataset to be a similar representation of what our real-world data would look like in those cases. For the split, we proceeded with a usual 80–10–10% split for training, validation, and testing datasets, respectively.

For the category inference, we had three different levels of category: a category, a sub-category, and a root category. The “category” is the middle- or second-level classification of a product. For example, a product “BeautyBlender” might have “Color” as root category, “Complexion” as category, and “Foundation” as a sub-category (view the product here). Now, there could be two ways of approaching this: The first is to merge all levels of hierarchy and then predict the merged category (however, since this would increase exponentially the number of categories and make it difficult for the model to learn discarded this approach). The second is to predict a lower-level category and then map it to the higher-level categories based on specific rules. Given our constraints, we went with the latter. So in our first iteration, we focused on predicting the category (amongst 30 categories present) and then later did the mapping.

Since this is a multi-class classification problem, we chose logistic regression, which estimates probabilities for each class and then picks the one with the highest probability, for our model since it is simple, adaptable to perform multi-classification, and known to work well in this context. For the first iteration, we limited the model to predict only a single class rather than a set of classes, which could happen in the case of a product bundle. Though the same concept could be extended further to create a multi-class, multi-output classification model (where each product can be categorized into multiple categories instead of just one), the goal was to complete one iteration of a simple end-to-end model first and then improve upon that. We used grid search to tune the hyperparameters on the 10% validation dataset and log loss as our final loss function.

We then performed our tests on the 10% hold-out testing dataset and calculated the overall accuracy, precision, recall, and F1 score metrics. At the model level, these calculations alone might not have given us a realistic estimate of the strength of the model — that is, they may have predicted well within certain categories but not for others — so we also evaluated the class-level metrics.

Looking at the above statistics, the model appeared to perform fairly well on most of the classes, save for a few — namely, precision and recall for classes 22 and 23. Why might this have happened? One reason is that although we conducted stratified sampling to split equal ratios of examples amongst the training, validation, and testing datasets, the number of examples for those particular classes was quite small when compared to other classes. Classes 22 and 23 (which correspond to “Body Sets” and “Nail Art”) had roughly 16 examples each, whereas other classes like 0 (which corresponds to “Eyes”) had almost 1,000 examples. This led to an imbalance between the classes in the datasets, which rendered the model unable to learn and predict those categories, eventually resulting in poor performance. The solution here would be to add more real-world data for the corresponding classes or apply data augmentation techniques for NLP.

Another complexity we considered was the potential for typos in product descriptions, which could affect predictions at the time of inferencing. To address this, we used TextBlob (a Python library for processing textual data) to rectify misspellings at the time of inferencing. (We used it after removing the stop words but before the lemmatization.) The same could have also been applied at the time of training, but doing so would have been more time- and resource-consuming.

Bringing It All Together

Once the model was trained and tested, we built a simple UI for others to use. We focused on serving the model in batch (e.g., inferring metadata about batches of products for our internal datasets). We created a basic HTML and CSS bootstrapped web application with Python’s Fast API as our back-end to serve our internal customers. The API would simply upload the batch CSVs, and submit and trigger the Databricks notebooks using Databricks APIs either as notebooks or as jobs.

Since the model was initially developed in a time-restricted environment for the Hackathon project, parts of the system were run locally on our computers for the proof of concept (POC). The initial deployment of the PoC was done on our internal server with some basic EC2 instances on AWS, with the ability to scale up as needed. We concluded that additional, periodic human monitoring would help ensure smooth operations. For the next steps, we will reconsider aspects like throughput, server load, and continuous automated monitoring — and we look forward to serving the model in real-time, making it possible for us to infer metadata about any product on the fly and improve user engagement for a truly delightful member experience.