Automate Product Creation by Clustering Millions of Shops Offers With Machine Learning
From Many Offers to One Product
Every day idealo imports billions of offers from shops of all sizes. In order to find different prices for a specific product, those offers need to be directly assigned to a product. When searching in our app or on our website, users can find products with a title, images, a few attributes, and descriptions. And these products need to be created. The big question, however, is how.
In this blog post, I will explain how we as a team at idealo inventory are trying to automate product creation through a machine learning clustering approach of offers in the shoe category.
The creation of the idealo product portfolio still needs a lot of manual tasks. But when it comes to scaling this approach shows its weaknesses: Hire more people? Which products should get priority and be created first? How many offers will never be shown to users because there are no products they can be matched to? Thinking about how to automate this process is therefore obvious.
But How Can We Automate It?
The main idea was to create clusters from offers with similar attributes in the shoe category and create a new product out of it. We picked the shoe category because it had fast-living products with a high turnover rate. Keeping up with the manual creation process is therefore rather difficult.
The first idea to create a product is by grouping those offers by an identifier — e.g., by the EAN (European Article Number). But clustering by EAN has some limitations: Not every offer has an EAN. Some EANs are simply wrong. In the shoe category, every shoe size has a different EAN, and you would get one product for every size — not very user-friendly. Therefore, low data quality and missing values were the reasons why an easy solution did not fit.
Despite the unequal data quality, two attributes were almost always there: the title and the image of the offer. This was our starting point. If we could cluster those offers which have a similar title and a similar image, we would have our product.
Introducing Bert! From a Title to a Vector Representation
The idea was to create embeddings of the title and the image, combine them and then do a KNN-Search to find those offers whose vector representations were very close to each other. Embeddings are representations of words or images and therefore contain all the information. These representations come as vectors of real numbers. It is assumed that embeddings of semantically similar objects are similar in some metrics. So, if you have good embeddings, you can compare them or search for the next one.
With the development of self-supervised pre-trained neural networks like transformers in the area of Natural Language Processing (NLP) and Convolutional Neural Networks (CNN) in the area of Computer Vision (CV) which could create those vector representations, we could give it a try.
Data Need to Be Prepared — The Preprocessing Step
We had shoe offer data from 38 shoe categories and many different shops — around 15 Mio offers. We then not only used the title of the offer but added some other attributes which were valuable to create a long string, which used the title and attributes like gender, color, product type, EAN, and HAN (Manufacturer Article Number) number. If one attribute was missing the space after the attribute name was empty. We also cleaned the title (lowercasing, stop words, punctuation) and kicked out shoe sizes. We also did not process offers further which had less than two tokens (two words) in the title.
There was no need to preprocess the images very much — we only resized them to 300x300 pixels.
Give Me the Vector — The Encoding Step
Before we could get embeddings for our titles, we needed to train a model with our specific shoe data. We picked a Sentence Transformer Model (sBert) which is pre-trained for both sentence and contextual word representations with 340 million parameters and 24 layers. We trained the model with more than 800.000 positive, negative and hard-negative pairs.
For getting the image embedding we used a Convolutional Neural Network (CNN) pre-trained on ImageNet and added an ArcFace layer to it. ArcFace is a facial recognition model which takes two face images as input and outputs the distance between them. Our model has been trained on more than 80.000 shoe images.
We will skip the part on how we finetuned those models since it would be a blog post itself. In short, we did several experiments with different models, combinations of attributes, and different hyperparameters.
We then had our inference models to create embeddings for every offer: For every composite title and for every image we got a 768-dimensional vector. By combining the text and image embedding through addition we got our final embedding which we used for the next step: the KNN-Search.
Finding the Right Neighbours — The K-Nearest-Neighbour (KNN) Search Step
For finding vectors that were very close to each other we used FAISS, a library for efficient similarity search and clustering of dense vectors.
For every vector, we searched for the 50 closest vectors within a certain threshold and returned their distances and indices. Since it could happen that vectors appeared in different searches, we used those offer pairs from the KNN-Search to build a graph that could finally create our clusters. This was done with the help of a label propagation algorithm (LPA — a standard community detection algorithm for graphs) from the GraphFrames library.
We now had built more than 700.000 clusters — every cluster had a different amount of offers which were similar regarding their title and their image. From this starting point we could create products.
From Cluster to Product — The Product Creation Step
To finally create a basic product, we used all offer attributes like color from all the offers in a cluster and deleted the duplicates. We decided to take pictures from the offer which had around 5–10 pictures with a fallback option if we only could find one. To avoid a mismatch between the image and the title we also used the title from the offer with the chosen images. We also slightly adapted the title with some regex. Et voilà! Our basic product was created, and we also already had all the offers which belonged to this product. This basic product could now get adapted for further enrichment.
What the Future Brings
Of course, this is not the end of our journey. We just finished a test phase and shifted our created shoe products to the front end and our customers, which was quite successful.
Our next step will be an adapted architecture for production and future thoughts on multimodal models which can get trained on both: image and text or even sound. Also, with the developments like ChatGPT3 and ChatGPT4, we need to constantly think about what future models can achieve — with all their disruptive consequences. Do we need several models for different languages, or will our future models be sufficient? How many models do we need to train for other categories? How do we evaluate properly? How do we integrate feedback? What are legal issues? How adaptable must our architecture on AWS look? Where are our boundaries?
Before closing this article, we want to mention that this achievement was due to our great partner work between different teams at idealo — especially our machine learning team, which put a lot of effort into training and fine tuning the models. We collaborated closely and we were only successful because of this collaboration.
And by the way — this blog post was not created by a machine. But how do you know?
Do you love agile product development? Have a look at our vacancies.