Image-Based ML Techniques To Classify Billions Of E-Commerce Products Into Thousands Of Categories
In Criteo’s Universal Catalog team, we interact daily with billions of products to create one of the largest e-commerce catalogs worldwide: 25+ billion products. These products, provided by our e-commerce partners have different fields of data that we use to create enrichments: new product fields that standardize the given data and are reused by Criteo teams worldwide. An important enrichment is the product classification per category.
Here, I will describe how I resolved the challenge of classifying e-commerce products into thousands of categories by using their main image. I chose a deep learning approach with TensorFlow on GPU with a big labeled dataset of several millions of images.
At Criteo, we have several tens of thousands of e-commerce partners that provide us their catalogs for a total of 25+ billion products. These products are recommended to internet users, through our online ads, with respect to the relevance for the internet user and for the ad campaigns of our e-commerce partners. To guarantee the quality of this recommendation, we need to standardize this set of heterogeneous catalogs. In particular, each product should be classified in its e-commerce category. And this set of e-commerce categories should be the same for all these products, regardless of the original catalog. However, each e-commerce partner provides us with categories for each product, but not in a common referential. At Criteo, we use a classification largely used in the e-commerce ecosystem provided by Google: the Google Product Taxonomy. It is used only for retail products, as other techniques are used for different types of products. Then we re-classify the retail products in this taxonomy. All possible retail products can be classified in the tree structure.
Above is a sample of the aforementioned Google Product Taxonomy. It is a tree structure that we have truncated down to leaf categories on the 4th level or less. We have built a machine learning model to predict for each product all the categories until its leaf category. But we only use the predicted leaf category, the other predicted parent categories being for debugging. The knowledge of the predicted leaf category allows retrieving the path to the root category.
The recomputed leaf category is called the “universal category”. If a catalog already uses this taxonomy, we re-compute it anyway. Once recomputed, all these catalogs make up a unique big e-commerce catalog (more than 25 billion products), named the “Universal Catalog”, that is used throughout the entire Criteo ecosystem.
This problem of classification of products exists in the Criteo context but is a widespread problem in the e-commerce ecosystem as soon as you would like to merge catalogs from different sources (where the catalogs do not have a common format for the values of the category field).
Currently, in production, to recompute the universal category of each product, we have built a machine learning model based on the textual features of these products. But we do not exploit the product image. Could we predict the “universal category” by using only the product image with good performance? For this ML problem, the feature is the unique main image and the label to predict is the category to which the product belongs.
Here, I will describe how we resolved this challenge of classifying e-commerce products in these categories using their main image. It is a supervised ML problem targeting K classes, where K is +3000 the number of leaf categories in the truncated Google Product Taxonomy. To build this classifier, I chose to use deep learning, an approach that works well if you can build big training datasets, as was possible here.
In this supervised deep learning situation, I followed these steps to train and run my chosen deep learning models:
- Create big labeled datasets for training/validation and testing: (feature=product image, label=product category) with millions of e-commerce products categorized by using distributed computing with Spark
- Choose deep models based on the state-of-the-arts of Image-based Deep Learning techniques
- Create a generic deep learning architecture to resolve this ML problem of classification
- Train each model on the training/validation dataset with TensorFlow 1.15, then TensorFlow 2+ on several GPUs
- Evaluate each model on the testing dataset with the accuracy and the Criteo business value, then analyze their scores with a comparison grid and multi-dimensional confusion matrices
Let’s jump into it!
Creating a dataset using distributed computing
At Criteo, we have already accumulated millions of products annotated with their target category. This is great, but these existing labeled datasets only have links to images, and not images themselves! This means that there is -prior to the work mentioned here- no available image dataset for partner products!
To check how much of the images were unavailable or too slow to respond, and using annotated data already available, I first downloaded a small sample of 10'000 products over 2 hours. I observed 1% invalid or missing links to images, and about 12% of images could not be downloaded (images no longer existed, links existed but redirect to the homepage, etc.)
The naive, sequential approach used to download 10'000 products was fine but when scaled up to more than 36 million annotated products it was way too slow (Estimated 300+ hours). To fix this, I used a fully distributed approach, exploiting Spark’s capabilities to their finest by running the code over a cluster of hundreds of 64-core containers. Although some issues appeared as a result (needing to create a custom Python executable for the Spark session to have access to all the necessary packages, adding conditions to our containers in case partners blacklist them for spamming download requests and other problems), after solving them the download could be run in the same time for a dataset more than 3'000-folds bigger.
After downloading the images, we preprocess them (resizing to reduce memory usage, normalizing pixel values, or other model-specific changes) and apply one-hot encoding to the labels before storing both the images and labels as TensorFlow TFRecordDatasets on HDFS.
Once everything is downloaded and stored, we can split the data using a ratio of 70–15–15 for training (~25 million products), validation (~5 million), and testing (~5 million) respectively. Now that our dataset is constructed, it is time to dive into the deep learning problem itself by choosing which models to use.
Studying state-of-the-art methods
At Criteo, we work with TensorFlow 1.15. Although TensorFlow1.15 was very convenient at first, this version soon proved to be an issue: poor compatibility with custom TFRecords (no batch training), latest models unavailable, lackluster documentation…
On the figure above, circled in red, we see that Inception V3  and Xception , although good, are outperformed by more recent models unavailable in TensorFlow 1.15.
Then I pushed through for a change of version from TensorFlow 1.15 to TensorFlow 2+ to test more recent models at the forefront of the state-of-the-art. Although this meant a complete port of the project, it was ultimately worth it.
In this project, we first used:
- InceptionV3 for its impressive results in ILSVRC’15
- Xception for its design (made for classification on a large number of classes)
These models were especially useful when we were bound by TensorFlow1.15 (as seen before, they were state-of-the-art then), but as we moved to TensorFlow2, we also added EfficientNet to our comparison basis. These models are of course not all the models available, and this is still a point of improvement in the work we present here.
Even though the chosen models provide a solid baseline, we need to adapt them to our problem before we can use them with our dataset.
Creating a generic architecture
To be able to quickly change the neural network we want to experiment with, it was important to design a generic architecture.
We create a template that adds finishing layers to a chosen deep learning model in order to predict all 4 levels of the Google Taxonomy. These layers are weighted according to depth to account for the fact that our clients want the level 4 category above everything else.
Training the models
Once a model is chosen, we train it according to the following pipeline:
- Connect to Criteo ML containers.
- Train the model:
a) Fetch the training and validation datasets stored on HDFS as TFRecords.
b) Log the ML metrics (accuracy, loss, top_k_accuracy, …) on MLFlow
- Save the trained models on HDFS.
Having these models available on HDFS, we can compare the results on a different dataset: the test set.
Analyzing the results
After each model has been trained on the training and validation datasets that are common for all the models that we want to compare, we evaluate these models. The evaluation is also done on a common test dataset, which we call the golden test set, which is different from the training and validation datasets: no one labeled product is present at the same time in the training/validation dataset and in the test dataset.
After model training comes evaluation. Where image classification tasks generally use Accuracy as a metric (getting a percentage of correctly predicted classes), it is in our case not the most appropriate metric (although it still matters!). As an online advertising company, it is far more important for Criteo to properly predict shoes or clothing (popular advertisement items) over farming truck tires: this is intrinsically linked to what we call the business value of a product. This business value is a metric computed based on the popularity of our advertised products: the more a product is bought, the higher its business value.
By using a class’ business value when running on our testing set, we are not only able to compute a weighted score for each model (by taking into account the business value of a predicted class to weight the prediction score accordingly) but also to evaluate a model’s performance in an easy-to-understand way, for instance with confusion matrices.
In the table above, we see how the metrics are not all correlated. Training time, which is generally a tradeoff (longer training times often correlate to better accuracy but also to larger models, which is not always something to look forward to) is here not always beneficial. By comparing the weighted scores of our different models, we notice that EfficientNetB3, although better on ImageNet than all the other models shown here, is underperforming on our testing sets: accuracy is high but weighted score is low. Xception, designed for classification on a large number of classes , has the best results, although still in the same range as the other models.
In confusion matrices, we log the predicted classes vertically and the true classes horizontally. When a model predicts the correct class, it will be logged in the diagonal. In the figure above, we see how our model predicted “Shampoo & Conditioner” about 15% of the time when the correct class was “Skin Care” (or “Baby & Toddler Tops” about 10% of the time for the “Baby & Toddler Outfits” target), products which do resemble each other visually.
Conclusion & Going further
With over 90% accuracy on a testing set of more than 5 million images, it is clear one of the next steps is to integrate this project with the NLP methods already used in production to build the Universal Catalog. This could be done for instance using multimodal learning  or by combining both methods, for instance using ensemble learning .
To integrate a new model in production, we need to take into account another very important consideration: keeping the inference time low. We mentioned in the introduction how Universal Catalog interacts on a daily basis with billions of products. If a model’s prediction step is too slow, it could dramatically slow down the entire Universal Catalog pipeline that predicts on billions of products each day to provide their universal category.
With models reaching levels of accuracy similar to or above the state-of-the-art accuracy quoted previously for ImageNet, it is expected that each additional percentage of accuracy will be hard to obtain. Although using the latest 2021 models (NFNets  or EfficientNetv2 ) is definitely an option, creating our own models is also starting to be an attractive route!
Thanks for reading!
I would like to thank Romain Beaumont for his expertise in machine learning, Gilles Legoux for his help with distributed computing, and for his supervision throughout this project. Finally, I would also like to thank the rest of the Universal Catalog team (Alejandra Paredes, Nasreddine Fergani, Hadrien Hamel, and Agnes Masson-Sibut) for the great work environment they let me experience.
Want to work with the team? Check out our open opportunities:
 “Xception: Deep Learning with Depthwise Separable Convolutions”, François Chollet