Product Categorization: a simple thing that impacts big things

Published in

Blibli.com Tech Blog

5 min readJul 20, 2020

Creating interesting products, taking great pictures of them, and putting detailed information are how Blibli maintains its product quality. Also, one little detail that can make all the difference in the world: product categorization. Product categorization is the placement and organization of products into their respective categories. In that sense, it sounds simple: choose the correct category for a product. However, this process is complicated by thousands of new products every day. Doing this manually will be prone to human error.

Improves Everything

These are why categories are important and how to use them:

Better User Experience. Users who are browsing can intuitively make their way through Blibli to find what they want rather than scrolling through pages with no end. If users want to search android phones, they just find them under category Handphone, Tablet & Wearable Gadget > Handphone > Android. Making Blibli easy to navigate is one of the most important elements of UX and will lead to higher conversion rates.
Improve Search Relevance. Having a correct categorization allows our search engine to fetch products quicker and delivers more accurate and relevant products.
Ensure sellers put products in the right category. Sellers are confident that their products are visible to our customers. Then customers do not miss any products just because they are misplaced.

What do we do?

We’ve known how important product category is. The problem is numerous products have incorrect product classifications. This is because there are some manual steps in the product upload process including when selecting the categories. As you know, it is highly potential for humans to make mistakes when inputting the data manually. To solve this problem, we go to Machine Learning!

This is the typical scenario of a multiclass classification problem on which we can train a machine learning model. Then the model uses text classification methods to predict a given product name into 1000 categories that Blibli has.

The data

There are over 8 million products registered in our data warehouse and already categorized which handled by human manually. But as mentioned earlier, this is not 100% trustworthy. What can we do about it. In Machine Learning we trust! This is because we do not have other external data to be used that perfectly match with our categories. Here is how the data looks in our case

As is evident in the examples above, the categories we want to predict are hierarchical with different levels. One of the examples has 4 levels categories with Bliblimart at level 1, Minuman at level 2, Susu at level 3, and so on. Another one has only 3 levels. So, we choose the lowest category that every product has, which is level 3. All the levels are combined together and treated as one category.

The noisy and imbalanced data make this problem still challenging. The number of unique products in different product categories varies starkly. One category has more than 70k products, while there are several categories that only have below 100 products.

Therefore, stratified and capped Random Sampling is used to solve this: ensure equitable representation from each category and deal with that skewness.

Play Time!

Prepare dataset — The text data first needs to be converted to a numeric representation before ML algorithms are applied to it. The method that we use is tokenizer from Keras. This process also covers preprocessing step like removing special characters, converting all to lower case and taking words with more than 3 letters in addition. Before that, Divide the data into 80% for training and 20% for validation

Instantiate and train the model — Once the text data has been converted to a numeric representation, it is ready to apply classification models. After experimenting with various classification algorithms we ended up with RNN using the Long Short Term Memory (LSTM).

The first layer in the network, as per the architecture diagram shown above, is a word embedding layer. This will convert the words (referenced by integers in the data) into meaningful embedding vectors. We define an Embedding layer with a vocabulary of 30,000, a vector space of 100 dimensions in which words will be embedded, and input documents that have 10 words each.

The next two layers are LSTM layers with 128 nodes in the hidden layers each within the LSTM cell. The first layer of LSTM returns all the sequence means the output dimension will be the same as the input, so it can be passed to the second layer of LSTM.

The next layer in our Keras LSTM network is a dropout layer to prevent overfitting. Finally, the output layer has a softmax activation applied to it. Below is the complete summary of the model

The beginning of the end

We trained and tested more than hundreds of models to arrive at the best combination of learning rate, epoch, and architecture and gives results an 87.89% accuracy, 87.64% precision, and 87.84% on recall with an average of f1-score is 0.875.

We realized that this result can still be improved. As we do categorize the entire products manually, human error is still in it. To refine this, we identify all of the products that we have with this model, and we suggest the mismatch product categories to “product owners” that should be checked by hand. “Product owners” can use it to give feedback by telling if the product is in the right category, and if not, they update the category to what it should be stored. This can improve the quality of the data and have an impact on accuracy in the future.

Continuous Improvement Is Better Than Delayed Perfection— Mark Twain