Learning from different types of data, without supervision
Clive Humby, a British mathematician said that “ Data is the new oil” back in 2006. We all know now how precise this statement is. Then isn't limiting to only a single type of data restricting us from doing more complex tasks? We have created domains in deep learning like Computer Vision, Natural Language Processing where we primarily focus on one type of data. But there are some kinds of tasks, which are questioning the whole concept of making a machine intelligent. Are they truly learning something or just mimicking the data they get fed with? Some examples are Image captioning (given an image, computer have to produce some caption related to it), Visual Question Answering (Computer have to answer a question related to a given image), and many others. Dealing with this type of complex task has made us think, how to fuse the knowledge of all these domains and use the combined vast data. The domain of integrating various data and modalities and using AI as a medium to solve this is called Multimodal Machine Learning/ Multimodal Learning.
So to deal with this I would like to tell you how I tried to tackle it using an example, product matchings in E-commerce websites. I will be using the recent competition in Kaggle titled “Shopee — Price Match Guarantee” as a reference for the data. So first let's discuss the problem we have in our hands. The official problem statement is
“Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products. Currently, a combination of deep learning and traditional machine learning analyzes image and text information to compare similarities. But major differences in images, titles, and product descriptions prevent these methods from being entirely effective. In this competition, you’ll apply your machine learning skills to build a model that predicts which items are the same products.”
The data given to us had 3 features and one label group, the image of the product, corresponding title or description, perpetual image hashing, and the label group respectively. We will be discussing the features except for the image_hashing, which is beyond the scope of this article (let's say I am saving something for my next article ;) ).
At the first glance, someone can think this to be a classification task. Since images and corresponding descriptions are given with the label groups, we have to just classify which group the products belong to. But there is a little twist here. First of all, there are almost 11 thousand classes for around 30 thousand products(in the train data). Which makes the product per class very small and difficult to classify for any model. But the bigger problem is that for the test data the label groups are different from the train! and so we cant simply make a classification model. Since if we make a model to classify the products(which may give us a good training score) but while inferencing on testset it may predict gibberish labels since the labels don't match with the training set.
Hence the second part of the article’s title “without supervision”. Here we will try to make a model which will perform well enough even for labels it hasn't seen before!
To do this task we will be looking inside a classification model and understand what’s the model actually trying to learn. We will look inside the hidden layers of our neural network. The weights in the model while training, learns patterns in the data. If we feed data to our trained model and take out the output from the hidden layers, these outputs are also called embeddings, we will observe that the embeddings of input of the same class are quite similar. It uses the patterns learned to differentiate between objects of different labels. Due to this if we try to visualize these embeddings in a two-dimensional space (we can do this by making the hidden layer consist of 2 neurons), we will observe that the inputs of the same class will cluster together.
In our case, we will make the model output two things. During training, it will use the output of the last layer for classification, but to compare the embeddings we will use the last hidden layer’s output (the last hidden layer is the layer before the final classification layer). We can also compare the embeddings of other hidden layers, it's a choice, In my case, I have selected the last which usually gives a better representation of the embeddings. Hence even if there is no associated label in the test set, we could still figure out which products may be similar, based on their embeddings. But what will we use to measure the similarity? There are many ways like finding Euclidean distance or cosine similarity between embeddings. Here I have used cosine similarity. The closer the value of cosine is to 1, it's similar, and if it is closer to -1 it means it's different.
Now let's focus on merging the different elements we have discussed and make a full pipeline.
To get the embeddings from the images and texts, I used pre-trained models. For images, I used models trained on ImageNet data from the PyTorch Image Models(timm) package, and for the text, I used a multilingual Transformer model from hugginface. Now to combine these two vectors(The output from the models is a one-dimensional vector) from both models, I concatenated them and then normalize them. This is just one way to fuse the vectors. Many research papers have said that fusing this embedding early in the model may likely improve our predictions, but then again we have to train it from scratch. Since I didn't have many resources I made use of the pre-trained models. The full model architecture is defined below.
The last hidden layer is an “Arcface layer”. As we have discussed before that the hidden layers try to separate out distinct things from one another using the features learned, the Arcface layer is doing the same thing, but better. It is a method that was introduced first for face verification purposes(you can find the paper link below). It makes the embeddings more closer for similar objects and pushes away dissimilar objects (by closer, I mean having almost the same value). Hence even if we are training with labels only for the trainset, we can use this property of the embeddings to classify unseen labels. The whole model was trained for 15 epochs, using cross-entropy as loss function and AdamW optimizer.
Now coming to the inferencing part. After the model is trained we then compute a similarity score(using cosine similarity) between each product present in our test set. So if we had ’n’ products, we would have to calculate (n-1) similarity scores per product. So after computing these scores for all the test products, we will have a (n x n) matrix. Again I am repeating these scores are giving us a sense of how similar a product is with the others, ranging from 1 to 0 after the score is rescaled. Then what we can do is we can decide on a threshold value. Scores above this threshold value will let us know which products are matched For example, think about a product_x, we have calculated all the similarity scores for all the other products with respect to it. This will be a vector of shape (1 x n) including product_x itself. So if the threshold value is 0.7, all the products whose corresponding score is more than that, will be considered a match. (Yes it would select itself too since the similarity score is 1 for itself)Similarly, this is done for every other product.
Hence our pipeline is finished and now we can submit our matched products and get the AUC score for the test set. We have made a full workflow without the knowledge of the test labels, isn't it great? (When I submitted for the first time, I was so damn excited)
My final model was an ensemble of 3 models, each using a different pre-trained backbone model. My final score on the private test was 0.73 and my best score was 0.732. AUC was used as the evaluation metric. The winners and the grandmasters had used various tricks to get higher results, like using Graph Neural Networks, Post Processing tricks, Boosting trees, etc but everyone used the above-discussed idea as their base model. I strongly recommend reading the different strategies the winners used. This competition was special for me, as It gave me my best solo leaderboard position till now. Apart from this, I learned how to use GPUs for data science tasks too(like calculating the score, finding the matches), thanks to RAPIDS by NVIDIA. With the power of GPU, my total workflow time decreased by almost 30%.
Hope I could give you some intuition behind the working of this model. This is a field of active research, and I hope after reading this, you have attained some level of curiosity to explore further.
Thanks for reading!
Links:
- The competition https://www.kaggle.com/c/shopee-product-matching/overview (see the discussion panel for winning solutions)
- My codes for the model https://github.com/mrinath123/Shopee_notebooks (still need to upload some more scripts)
- Go through this to understand embeddings, cosine distance, and Arc face intuitively https://www.kaggle.com/c/shopee-product-matching/discussion/226279
- Arcface paper https://arxiv.org/abs/1801.07698
- Multimodal Learning resources https://github.com/pliang279/awesome-multimodal-ml