460J Final Project

8 min readMay 9, 2022

Intro

Shopee is an online shopping platform headquartered in east Asia catering to independent sellers, much like eBay. Shopee aggregates listings from hundreds of thousands of sellers into specific categories and products for their customers to browse. They seek a machine learning algorithm that can predict which listings, all of whom may have different images/descriptions, are the same product. We approach this problem using a Siamese-style network learning highly discriminative features on the images complemented by a TFIDF text feature extractor.

Data

Our training data consisted of 34,250 product posts, and 11,014 unique products. Each product had a maximum of 51 unique posts and a minimum of 2 unique posts. Each post has the following data values:

posting_id — unique id assigned to every post
image — .jpg file of product with a hashed filename.
image_phash — hash value calculated by the open source perceptual hash library. Phashes are “close” to one another if the features of the image file are similar, unlike conventional hashes.
title — title, produced by the lister of the post.
label_group — integer value that identifies products. Posts with the same label group are classified as the same product.

Top 25 most common words — Top 10 Most Common Words in Titles

The presence of duplicate pHashes shows us that we don’t have to run a computationally-expensive model on all listings. We could first cluster listings with duplicate pHashes together, thus shrinking the set, which could greatly reduce computational requirements.

*Example of Postings with Similar Images/pHashes but different titles*

This shows us that neither the images nor the titles can solely be trusted for the task.

Background/Prior Art

The top performing models on the leaderboards for this competition followed a trend of combining image recognition with NLP models. The most common approach from the top 10 teams was using a BERT based NLP model combined with an NFnet image classifier to generate multimodal embeddings. After that, teams found unique ways to generate clusters within their datasets that predicted postings’ sameness.

Our Model

One simple technique could be to treat this as a multi-class classification problem with each unique product having a unique label. But, there are multiple challenges to this setting:

The no. of unique products is a huge set. This means the no. of labels in the setting could blow up quite easily, rendering the problem intractable.
The testing set could contain products not available inthe training set. In this case, the model would have a hard time assigning an appropriate label to the listings in the testing set.

Keeping these challenges in mind, we try to solve this problem using a Siamese-style network trained on different losses, particularly the triplet loss and the contrastive loss, enabling it to learn highly discriminative features from the images corresponding to various product listings.

This dataset poses another interesting problem. Listings associated with the same product could have similar images but completely different titles as well as similar titles but completely different images. In such cases, neither the image no the title can be solely relied upon for classification. Hence, we train a Term Frequency-Inverse Document Frequency (TFIDF) model to complement our Siamese-style network by extracting features from the title associated with each product listing. In this section, we elaborate on the various components of our final model as well as their parameters.

The Siamese-style network

A Siamese Neural Network is a class of neural network architectures that contain two or more identical subnetworks. ‘identical’ here means, they have the same configuration with the same parameters and weights. With the aid of One-shot learning, a few images are sufficient for Siamese Networks to recognize similar images in the future. Siamese focuses on learning embeddings (in the deeper layers) that place similar classes close together. Hence, they can learn semantic similarity.

We choose EfficientNet-B0 as the backbone of our Siamese-style feature extractor. Why we choose this particular network has to do with its performance on the ImageNet dataset in relation to the no. of trainable parameters in the network. We choose ImageNet for benchmarkiig the models since it consists of millions of images of random things, something we would hope to see in this competition. The below plot shows the performance on ImageNet vs # parameters for various state-of-the-art deep convoltional neural networks.

We can see that the EfficientNet series uses far less parameters that other models but offers state-of-the-art accuracy on the ImageNet dataset. Ideally, we would’ve liked to go with EfficientNet-B4, since it offers close to top accuracy for a minute increase in the no. of trainable parameters. But, Kaggle doesn’t provide us with sufficient memory to train anything better that Efficient-B0.

The next step is training the model with appropriate loss functions. We train our backbone using two distinct loss functions and compare their performance on Kaggle’s hidden test set.

Triplet Loss

Triplet loss is a loss function where a baseline (anchor) input is compared to a positive (a different listing image of the same product) input and a negative (a listing image of a different product) input. The distance from the anchor input to the positive input is minimized, and the distance from the anchor input to the negative input is maximized.

During the training process, an image triplet consisting of an anchor image, a negative image, and a positive image, is fed into the model as a single sample. The idea behind this is that distance between the anchor and positive images should be smaller than that between the anchor and negative images.

For training, we create a dataset of triplets by randomly picking anchor, positive and negative images from the dataset using the “label_id” field of the data.

Contrastive Loss

This loss is used to learn embeddings in which two similar points have a low Euclidean distance and two dissimilar points have a large Euclidean distance.

where DW corresponds to the euclidean distance between points on the output manifold, and

And GW corresponds to a parameterized function, parameters of which are learned by our model. Unlike the triplet loss, this can be treated as a couple loss. For training, we create a dataset of couples, with the label Y=1 for matching partners and Y=0 in the case of distinct partners in the couple. Again, we accomplish this using the “label_id” field of the data.

Inference

For inference, a single branch on the Siamese network is used to extract features from the images in a straightforward manner.

Using the extracted image features

In the above sections, we discussed how the model would learn highly discriminative features from the listing images. Now, we discuss how we process the extracted features to finally group listings into the same/different products. For this, we train a K-NearestNeighbors model to cluster the extracted features with 50 nearest neighbors, since each unique product can have a maximum of 51 listings. Before being processed by the KNN, each embedding is fused with the corresponding image_phash to squeeze out all relevant information from the given data.

Term Frequency-Inverse Document Frequency (TFIDF)

TF-IDF (term frequency-inverse document frequency) is a statistical measure that would evaluate how relevant a word is to each listing title in a collection of titles. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

The term frequency of a word in a document. It could be the raw count of instances a word appears in a document, which can then be normalized. There are ways to normalize the frequency, by length of a title, or by the raw frequency of the most frequent word in a title. The inverse document frequency of the word across a set of titles measures how common or rare a word is in the entire title set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of titles, dividing it by the number of titles that contain a word, and calculating the logarithm. Multiplying these two numbers results in the TF-IDF score of a word in a title.

This helps us transform words into numbers (and each title into a vector of numbers) The numbers of the vectors represent the content of the text (kind of). Then, titles with similar, relevant words will have similar vectors, which is what we are looking for.

Using the extracted text features

We compute the cosine similarity metric on the word vectors to finally group listings into same/different products.

Combining the features

The final submission is made by combining the matches found by the two techniques. This summarizes our model. Next, we talk about the performance of our pipeline on Kaggle’s hidden dataset.

Performance on Kaggle’s hidden test set

First, we try to find a baseline score on the test data. For this, we develop a model which classifies the listings as belonging to the same product if either the title or the image of the listings match. This gives us a score of 0.573, which is now our target.

The Siamese-style network trained on the triplet loss offers a score of 0..636 while TFIDF test feature extraction alone scores 0.589. When fused together, the two boost the score to 0.689. But, the Siamese-style network, when paired with TFIDF wins the battle, scoring 0.721 on the Kaggle hidden test set.

Future Work

While our models performs well on the Kaggle dataset, there are a few ways we could further improve performance. Some of them include strategically designing triplets/couples for the Siamese-style network such that dissimilar images belonging to the same product could act as positives while similar images belonging to different products could act as negatives. This would make the models learn even more discriminative features. In addition, we would like to examine the performance of the ArcFace loss on the given data. On the text part, we would like to examine the performance of a more sophisticated text feature extraction model like BERT.