Differentiate more than 100 million Tokopedia’s Products with Vector Representation

Published in

Tokopedia Data

6 min readJan 31, 2019

Hi guys! Just wanna share about one of our Data Science researches, titled Simple Product Representations of E-Commerce Products, which was presented in IALP 2018. Basically, it is about representing each product into a sequence of numbers, so we can differentiate or find a similarity between a product and another. Don’t worry, I will keep the explanation simple.

In Tokopedia, we have a vast number of products. In 2018, there were around 100 million of active products from about 1.5k categories. That’s quite a lot, right? That’s why we need to know the product well to ensure that we provide the right product when customers search for it. In this case, product representation is critical. Let’s start with product title.

1. Product Title

There are many components of a product. Each product has a title, images, description, and others. Surprisingly, product title itself contains numerous information related to the product. For example, let’s take a look at the product picture below.

As you can see, the product title contains product brand (Apple), product type (Macbook Pro), dimension (15.4in), and even the color (abu-abu, or gray in English). This product title can be considered as a good product title because it has a lot of useful information. But, in reality, not all products have these much information.

Apart from incomplete information, the product title is also very messy. It doesn’t follow any grammar rules. It could contain unnecessary symbols or is constructed of multiple languages, i.e. English and Indonesian. Fortunately, we can still gather a lot of information from product title, but we need to preprocess it first, just like any other texts:

set all characters into lowercase,
remove non-alphanumeric characters, except meaningful symbols (e.g. “2.0” in ”Parrot AR Drone 2.0 Motor”), and
remove words that occur less than 5 times.

Next, we build our product representation based on the preprocessed product title.

2. Product Representation

We simply build our product representation by first doing word embedding using Word2Vec. Word embedding is very popular in NLP and I believe most of you has encountered it. If you haven’t, I will give some explanation to you.

Word Embedding

It might sound complicated, but the goal of word embedding is simply to represent each word as a vector, which is a sequence of numbers. Why numbers? Because numbers are easier to be processed by computer than string (text). For example, we might have vectors like this for each word:

iphone : [0.2, 0.25]
samsung : [0.25, 0.2]
xiaomi : [0.2, 0.15]
tablet : [0.5, 0.6]
smartphone : [0.6, 0.5]

If we try to visualize those vectors, we can see that words with similar meaning are close to each other.

In order to obtain the final word vectors, we compare CBOW and Skip-Gram methods. CBOW tries to predict the center word by its neighboring words, while Skip-Gram tries to predict the neighboring words by its center word. Surprisingly, words with similar meanings are near to each other. In our experiment, we did our word embedding using more than 25 million product titles.

Product Representation

After we did the word embedding and obtained the word vectors, we can simply get the product representation by averaging the word vector of each word in a sentence (i.e. product title, since we can consider product title as a sentence). We call this method as unweighted average.

Another method is called weighted average with noise correction, proposed by Arora, et al. The concept is similar to unweighted average. We need to multiply each word vector by its corresponding weight first before taking the average. So, the most frequently occuring word will have lower weighting, i.e. they are not important words. Then, we remove the noise in every sentence.

3. Product Similarity Benchmark

To evaluate our product representation, we prepare product similarity benchmark. Basically, we want to compare the performance of our model with our labelled data.

First, we gather our labelled data as a ground truth:

We sampled 4000 triplets of product titles from various categories.
Each triplet has three product titles: Anchor, Positive, and Negative. We arrange the product title so that Anchor and Positive couple is more similar than Anchor and Negative couple, semantically. Let’s take a look on the table below for example.

(Positive, Anchor, and Negative product titles)

On first example, “kaos bola murah” (Anchor) is more similar to “t-shirt jersey juventus” (Positive) than “celana panjang anak” (Negative), because the product title in both Anchor and Positive are tops.
When building the labelled data, we ensure that there are no common words among the triplet, to prevent bias in calculating similarity.

After we gathered our labelled data, we calculate the accuracy of the model and use Cosine Similarity as the similarity metric.

The definition of accuracy is as follows:

If the cosine similarity between Anchor and Positive products is higher than cosine similarity between Anchor and Negative products, then the model is correct.
Otherwise, it is incorrect.

We conduct our experiment using various models and word vector dimension. The result is shown in the table below:

Bold number indicates the best performance in respective dimension.
Dimension refer to the output vector length of each word in word embedding
CBOW and SG refer to CBOW and Skip-Gram for word embedding, respectively.
Avg and WC refer to unweighted average and weighted average with noise correction for product representation, respectively.

Based on the result, CBOW-WC outperforms all other models in most cases. We can also see that WC outperforms Avg. Interestingly, higher dimension doesn’t lead to better performance. Higher dimension means larger information stored. However, we don’t train the models to increase/decrease the product similarity, so this case is plausible.

4. Product Representation Visualization

We also try to visualize our product representation as an evaluation. We sample products from 5 different categories in Tokopedia. We compare CBOW-Avg and CBOW-WC methods in 1000 dimension. We use PCA to convert it into 50 dimension, then we use t-SNE to convert it into 2 dimension. Based on the visualization, we notice that using CBOW-WC, the products in the same category are clustered nicely compared to CBOW-Avg.

5. Conclusion & Future Work

We have shown that using only product title, we can create product vector that can represent the product correctly. We have evaluated the product representation vectors by doing product similarity benchmark and product representation visualization. The result shows that WC performs better than Avg. We are implementing this model in our search mechanism. Since there are so many products in Tokopedia, we choose 100 as the word dimension. We think it is the right value considering the accuracy and efficiency as factors.

If you want to know more detail about this research, please visit IALP 2018 for the journal. Thank you for reading. Stay tuned for our next research!