Product categorization: classical Machine Learning problem for a difficult e-commerce task

ManoMano Tech team
6 min readApr 6, 2020


As a DIY marketplace, ManoMano receives thousands of new products and their descriptions from its sellers everyday. Like in any shop, these products need to be stored correctly, otherwise they are difficult to find. In the e-commerce industry, stored means categorized, i.e. attributed to one of the 4000 categories of the website. In the early days of ManoMano, this task was treated manually. However, it quickly appeared this method would not be scalable with the high growth of Manomano: integrating manually thousands of products per day would cost too much while being too slow.

Human categorizers with thousand products a day

As it is a classical machine learning task, the data science team took the opportunity to try and automate the categorization process.

1/ How data science tackled this subject… temporarily

As described earlier, our task is to categorize automatically as many incoming products as possible into 4000 categories.

A sample of our different categories

Our dataset consists of product information (title, description, price, …) with their manually associated categories. This is the typical scenario of a multiclass classification problem on which we can train a machine learning model.
For a given product, the output of such a model is a vector with a probability for each category. This probability of the selected category can be interpreted as the confidence the algorithm has in its prediction.

As the classification should not be sloppy, we can’t label a product if our model is not confident enough. However each unlabeled product should be handled by a human. But as we said earlier, this is costly and slow. So, one of the most important tasks is to decide the confidence threshold required for a product to be categorized automatically.

The idea is that, for a given model, we’d like to identify the probability threshold (level of confidence) that fits business needs best.

To choose this level, we look at the accuracy-coverage dilemma (aka precision-recall in the binary classification case), that we can estimate through a train / validation split.
For a given probability threshold:

  • coverage is the proportion of categorized products, i.e. products which probability is higher than the threshold
  • accuracy is the proportion of correctly categorized products among the categorized ones (i.e. among those with probability higher than the threshold)

Of course these two objectives are conflicting: the more products you categorize, the lower the accuracy.

The perfect algorithm would provide coverage = accuracy = 1, ie. 100% of products are categorized, with 100% accuracy.

Obviously, this algorithm does not exist. However, we can talk about real algorithm. Let’s plot a realist accuracy-coverage curve:

An example of accuracy — coverage curve

As we can see here, if we choose to categorize 24% of our products automatically, 92% will be appropriately labelled, however 8% will be miscategorized.
Now, we have to pick the threshold level that fits best with our business strategy.

2/ Why is this approach not sufficient ?

This approach has proved efficient and been industrialized: implemented in Python (and its unavoidable sklearn library), and scheduled on Airflow.

A high proportion of products were automatically categorized, reducing costs and integration time.
Several months later, business owners realized that the algorithm performance had significantly deteriorated. They reported the problem to the data science team.

We launched investigations and realized we were facing a feedback loop problem:

  1. During prediction, the model miscategorizes a drill D1 with “Inflatable Pools”
  2. During learning, the model takes drill D1 as input and considers it an inflatable pool
    => The model learns that a description containing “drill” probably belongs with “Inflatable Pools”
  3. During prediction, the model categorizes a new drill D2 with “Inflatable Pools”
  4. Go to 2)

The model deteriorates because it feeds on its past mistakes.

This problem seems easy to resolve: the main flaw of this approach is to use data that are not 100% reliable (products that are categorized by the algorithm).

As we do not categorize the entire flow automatically (only products for which the model is confident enough), we could train our algorithm on manually categorized products only, which are more likely to be reliable (assuming humans categorize better than the algorithm).
But this solution couldn’t be effective, because the model would only be trained on a particular subset of the data, which is not representative of the whole catalog. Indeed, products that are manually categorized are the products where the algorithm is not confident enough to categorize it.
The risk that could emerge from this non representativity is that the model becomes very good at categorizing products for which it was underperforming, but without taking into account the products for which it was more performant that represent the majority of the flow.

To sum up, our problem solver is — as often — the data. It should be reliable and representative of the whole flow. For now, none of our approaches satisfy both:

3/ Human-in-the-loop to-the-rescue

As a quick fix, we chose to enforce an over representativity of reliable examples for learning, but a minority of random examples for representativity. This approach gave better results by correcting the algorithm mistakes (and reassured business stakeholders in charge of catalog quality). It was not sustainable though, as it did not guarantee representativity.
As stated before, the main prerequisites for our algorithm to be efficient and sustainable are representativity and reliability of the data. To guarantee both of them, we chose to sample randomly and every day some incoming products (categorized by the algorithm or not) that should be checked by hand by business owners.

We used Django to build a labelling interface:

Interface used by business owners to help categorization’s algorithm improvement

Business owners can use it to give feedback by telling if the product is in the right category, and if not, in which category it should be stored.

This process enables:

  • data teams to have a clean (reliable and representative) dataset to train their algorithms
  • business teams to improve categorization quality by themselves and have a frequent feedback on their impact on accuracy

This additional proportion of manually categorized products has a cost, but this cost is largely compensated by the gain in algorithm efficiency, which categorizes better and needs less manual corrections.

Now that business teams are more involved in the process, they have access to the quality metric of our catalog, a better understanding of what data science needs to ensure this quality, and are more likely to give interesting insights to data scientists.

We could also use a smarter strategy :

  • Sending more frequently items we are unsure about to manual labelling rather than randomly by using active learning
  • Training our model with different weights given how sure we were when we categorized them

Both of these methods are promising but require way higher engineering costs, while the current strategy performs well enough. Therefore, we chose to not implement it yet.

4/ Conclusion

Having a clean catalog is essential for an e-commerce platform. Categorization seems to be an easy multiclass problem at first sight, but it took effort for ManoMano’s data scientists to build a clean and unbiased prediction flow. The main pain point of this subject is the feedback loop of the algorithm, which leads to non reliable data.
The solution we have presented here, by sampling random products every day and have them labelled by humans, is the only way we found to ensure two mandatory hypothesis to train a classification algorithm:

  • data must be reliable so that the algorithm can learn true relationship between variables and target
  • it must also be representative of the reality, to have an identically distributed dataset and guarantee weighting of the categories predicted to be the same as the reality

To check both of these constraints, we have created a human-in-the-loop workflow including business stakeholders, hence reinforcing the collaboration between teams.

It has also greatly contributed to sharpen Manomano’s data scientist mindset: a close-knit collaboration with business teams is key to ensure the success of a Data Science project.

Special thanks to Alexandre Cazé, Yohan Grember, Romain Ayres, Jacques Peeters, Baptiste Rocca, Pierre Fournier and Etienne Desbrières for proofreading this article.