This work was done as the final project of a Data Science course. We chose a Kaggle competition (https://www.kaggle.com/c/shopee-product-matching/) as our challenge for this project. The following article describes the dataset, our workflow, and the challenges we encountered, including also our best solution and the way we achieved it.
E-commerce is a growing industry that has become a notable part of many people's consumption habits. The advantages of e-shopping are the quickness and easiness by which one can make a purchase, as well as the exposure to a variety of sellers and products. But this abundance can also be a disadvantage as it is hard to make a decision and navigate through the numerous options. Many sites are trying to offer filtering and comparison tools to improve this experience, but many people still feel lost and might end up without purchasing due to the confusion caused by this data overload.
Product matching can be used to compare products, and to find certain products. Consumers might want to compare products to see if they found a good deal or not, and a future application may search online for a product that was photographed after being found in a store / on the street.
Our data set has the following features:
- A unique ID ('posting_id').
- An image of the product('image').
- An image-based Phash* ('image_phash').
- A title of the product ('title').
Classes attributed to certain products ('label_group').
* Phash (Perceptual hashing) is an algorithm that produces a short sequence of characters that represents a certain image. It can be referred to as a sort of a "fingerprint" for an image. More on image hashing can be found here: https://people.cs.umass.edu/~liberato/courses/2020-spring-compsci590k/lectures/09-perceptual-hashing/
The dataset contains 34,250 samples, belonging to 11,014 unique products. Class sizes range between 2–51 and the average class size is 3.
To detect similar products based on their image and title, given a certain product.
*After not experiencing great success with the phash feature we decided to focus on the images and titles as our product's features.
The dataset contains 32,412 unique images, 28,735 unique phashes, and 33,117 unique titles. This means we have some duplicates on each feature (although there are no complete duplicates on all features).
Let's see how a random product (group of samples with the same label) looks like:
As we can see above, one of the challenges we faced was that products from the same class, had very different images.
Next, we wanted to look at the shapes of the images. We found out that there are 824 different shapes, all are RGB (3 channels) images, yet there are many different lengths and widths. The most common shape is 640 X 640 (12,259 samples), and 83.4% of the images have 1 of the 5 most common shapes.
Let's look at the titles. we will show the different titles of the class we used for the image demonstration.
As we can see above, another challenge we faced was that the titles are in different languages, and are not all in English. We therefore used a google translator API to detect the language and convert it to English if necessary.
Next, we looked at the distribution of title lengths (No. of words):
As can be shown, most titles have a length of 2–20 words, with a peak around 7–8 words.
We have 2 different sources of data: images and text, so we will use both computer vision and NLP (natural language processing) models in order to solve the problem. Each model will require different preprocessing steps, but in general we reshaped the images so that they will all have the same shape and we will tokenize the words in the titles. We will then use cosine similarity with our models in order to classify products by similarity.
We chose the F1-Score as our metric (as was also chosen in the Kaggle competition). Our baseline score is 0.00149 (calculated by simply returning a single class which is the largest class in the dataset).
First step — Creating a baseline model
Our baseline model was an ensemble of two models, one for images (a computer vision model) and one for titles (an NLP model). For computer vision we used a CNN (Convolutional Neural Network) model from the library imagededup, which uses the pre-trained MobileNet CNN from Keras as a model. It requires reshaping the images to 224 X 224. For the NLP, we created a tf-idf matrix after decoding to unicode and lowercasing the words, and we converted both of the models to cosine similarity matrices. We then used different similarity thresholds (0.85 for the CNN and 0.5 for the tf-idf) in order to determine the matches, and combined their results by simply adding the matches they found. This ensemble resulted in an F1-Score of 0.48.
Second step — Using more complex models to improve performance
Once again we chose ensemble two models. For the computer vision we used a Siamese Neural Network, which uses the same weights while working in tandem on three different input vectors in order to compute comparable output vectors and updating the weights by a triplet-loss function. It basically trains by comparing an image to a an image from the same class and an image from a different class each time, and it outputs embeddings that can be used for calculating distance using a simple K-nearest neighbors algorithm.
The Siamese network was built from three VGG16 pre-trained networks with a retrained output layer. It required a train-test split and image reshaping to 200 x 200. The output F1 score of the Siamese NN was 0.50.
Our NLP model in this case was the SequenceMatcher for the difflib library that compares sequences in a very interesting way (for more information, see the documentation page: https://docs.python.org/3/library/difflib.html). The SequenceMatcher alone gave an F1-Score of 0.55.
We then combined the Siamese NN and the SequenceMatcher in the following way: an image is chosen as a match if 1 of these 3 conditions is true:
1. The Siamese network returns a distance score lower than 0.95.
2. The SequenceMatcher returns a similarity score higher than 0.55.
3. Their combined scores (distance + (1-similarity)) is lower than 1.4.
These ensemble resulted in an F1-Score of 0.71.
The final model was served on an AWS server using Flask. Ideally it will receive an image and a title and create embeddings for them, and compare them to the existing database in order to output the images and titles of similar products. Right now it works with products that already exist in the data base.
This was a very challenging task. Matching samples from over 11,000 different classes that may contain high diversity in their titles and images. Given how low the baseline was, our final score of 0.71 could be considered as a good score. That being said, there is room for further improvement in the future. We believe the main improvement can come from the NLP part by trying to use deep learning models like RNN/LSTM and transformer models like BERT. In addition, increasing the database and tuning model parameters may also improve the results of the Siamese network.
Above all, we believe that this project shows the feasibility of overcoming the within-product variance issue in order to detect matching products.
To conclude, as the e-commerce industry grows, consumers can enjoy a larger variety of products from various sellers and have the luxuries of comparing and getting great deals in a click of a button. But this also holds the problem of an information over-flood which causes the exact opposite result. This is a classical case where machine learning can present solutions by offering tools that make the huge data accessible for consumers. This project may be used for the creation of tools such as these, that will improve user experience, and further develop the e-commerce industry.