When using RateS to browse for your favorite products, you may have noticed that products are under certain categories. Thanks to the hardworking Rate people, who come up with the categories, and manually put the products under their categories. However, as there are more products in the RateS database, the need to automate this labeling process becomes imminent. This blog post demonstrates the process of building a hierarchical text classifier that fulfills the need for automated classification.
Look Into The Problem
There are 3 major aspects of the problem — — our resource, our goal, and how to achieve the goal given the resource.
We have data. Specifically, we have product records, consisting of their names, description, images, creation dates, etc. Among all these fields, names and description are textual, from which we can usually deduce the category of a product.
Our goal is to assign labels to products, for example, “Women’s shoes” to “Heeled sandals brand X model Y”. Furthermore, there is a clear hierarchical relation between labels, for example, “Women’s shoes” under “Women’s wear”, which we call categories and parent categories respectively. It is also notable that some products may belong to more than one category and parent category.
A classifier assigns labels to products. The size of the text-based classifier is relatively small, and the speed to build and fit the classifier is fast. Therefore, the names and description of products are chosen as input. Moreover, the hierarchical structure of the labels can be exploited to produce a better prediction. That is why we make the classifier hierarchical as well.
Build The Classifier
Step 1: Load data
The records of the products are in the RateS database. I have a copy of it on my machine. We can have a glance at it in DBeaver. Next, I will set up a connection to the database, and retrieve the names, description, and categories of products. By the way, I am using Python 3 in Ipython Notebook.
Step 2: Filter Data And Stem Words
Now that we have loaded data, it is time to be a little picky. To scale the problem narrower, it has been decided that products with names or description other than in English will not be considered (Yes, we do have a few Chinese and Japanese records). The langdetect package does this work for us. Furthermore, it is reasonable to expect that the names and description not to be too short, like more than 5 words in our case.
After that, it makes sense to stem the text. That means we will treat words in its different forms as one, for instance, “shoe” and “shoes”. We can utilize the nltk (natural language toolkit) package to accomplish this.
Step 3: Transform Data
Having applied the methods in step 2, we now have a list comprised of stemmed text. We also have the corresponding parent categories (their indexes), since we would like to begin with the classifier on the higher level, i.e. classifying products into parent categories. The list is shuffled before being split. Next, we transform the text into vectors using the TF-IDF algorithm, before feeding it into the model. This is accomplished by using the pipeline, which handles a series of transformation. To speed up the training process later, we can cache the transformation. All of the algorithms are implemented in the sklearn package, and we can simply import and call the APIs.
Step 4: Train Models And Tune Hyperparameters
SVM (support vector machine) is a supervised learning model for text classification, with proven performance. It is selected to be our model, with its hyperparameters (C, gamma) tuned in a grid search. Subsequently, it is saved on the disk and will be reused soon. Other models can be trained in a similar manner. Again, we can find all these algorithms in sklearn. One thing that I would like to highlight is the SVM probability argument. By default it is False, but we set it to be True. This setting enables us to combine this model with models in the second level by multiplying the probabilities, which will be explained soon.
Step 5: Models In The Second Layer
If our task is to categorize products into parent categories, the job is done at the end of Step 4. However, we have more categories in the second layer, grouped under parent categories in the first layer. Therefore, we need to split data into different parent categories and train new classifiers within these categories. The code is quite similar to the previous steps so I will skip it. The only thing notable is that my experiments indicate that sometimes Naive Bayes models perform significantly better than SVM. As a result, they may substitute SVM as the model in second layer classifiers. Now we can claim that the training process is complete! 🎉
Use The Classifier:
Step 1: Load The Classifier(s):
In the context of hierarchical classification, there are a few classifiers at different levels which work together to produce the classification result. Now that we already have them, it is time to load them in the proper structure before performing prediction.
Step 2: Get Classification And Probabilities
The data we receive is going to be text, which will go through the same preprocessing and transformation as the training phase. Subsequently, we can obtain the probability of a product in a category based on the following rule:
P(x) = P(x|y) *P(y)
x: product m in category n
y: product m in the parent category containing category n
x|y: product m in category n, given that the product is in the parent category containing category n
This formula requires basic statistical knowledge, and I am pretty sure you understand it as you have managed to read this far. 😉
The following code takes multiple entries of products and returns categorical prediction for all of them. It is perfectly fine if you would like it to predict one product at a time.
Step 3: Visualize Results
We can interpret the prediction results by reading the numerical output, but isn’t it a little too dry? Let’s map the category indexes to their names, and use bar graphs to visualize the highest 3 probabilities in each prediction. This is achieved by incorporating the matplotlib package. Take a look at this product and read its text description. Well, it seems quite reasonable to say it belongs to “Sports & Outdoors”, and “Luggage Travel Gear”, that both have around 40% probability. The third highest prediction “Women’s fashion”, does not seem appropriate, though.
Step 4: Integrate Through HTTP API
We are going to deploy the classifier on the server so that when new products come, it can be automatically classified and saved into the staging/production database. We do that by adding an HTTP API to the classifier, using Flask, a micro web framework. The response is in JSON format, containing indexes of categories and their probabilities.
This blog post illustrates the process of creating and using a text classifier, featuring its hierarchical structure, probability visualization, and server integration. Performance of the model is not mentioned but is nevertheless crucial. If you are interested in it, the resources in the reference can help. Moreover, there are advanced ways to construct and evaluate the hierarchical classifier (while my naive approach is to multiply the probabilities and measure the accuracy), listed below for your reference as well. Last but not least, deep learning has taken text classification to a new level. Finally, please feel free to leave me a message if you want to know more about product classification at Rate. 😊