Application of KNN and Outlier Detection to Product Type Classification

Published in

Walmart Global Tech Blog

11 min readFeb 25, 2021

Introduction

Image classification — the task of assigning an input image one label from a set of categories — is one of the classical tasks in supervised machine learning with a large variety of practical applications. In particular, image classification has been widely applied in eCommerce to extract product type. Whether the customer might be looking for a women’s shirt or a TV, product type serves a significant role throughout the entire online shopping experience. If we don’t get the product type right, it’s very difficult for the customer to navigate the catalog and quickly find the right product.

However, classifying products into categories precisely and efficiently has been and still is a major challenge in modern eCommerce. There are two perspectives through which we could look at the challenges. First, for a real-world large-scale classification task, it is challenging to collect training data and to make sure the dataset is well balanced. Second, we need to understand the nature of the taxonomy. The number of product types in the eCommerce catalog is typically very large. In particular, we have a lot of so-called long tail product types, which have relatively small number of products. On the other extreme, we have popular product types that are not granular enough. The product type taxonomy is dynamic and evolving over time, and because of all these factors, it is almost impossible to retrain a new model whenever a change happens.

What are the alternatives? Other modalities like image could help. Here are a few examples where a picture is worth a thousand words. On the left are products with “Skirt” in their titles, and on the right are products with “Tank” in their titles. While the difference in the text input may be subtle, it is relatively easier for a classification model to learn the difference from the image input.

In this blog, we describe a solution that, given the different levels of precision of classification in any eCommerce catalog, should hopefully work regardless of their current level of accuracy; and that given the high traffic of new products uploaded daily and the dynamic nature of the categories, should raise the level of accuracy while at the same time reducing the cost and time of model training. We will cover the following topics: First, we will review the pros and cons of image-based classification. Then we propose KNN-based classification as our key improvement. As part of our overall proposal, we will show the impact of combining KNN and outlier detection. Next, we will deep dive into validation. We will finish with a discussion of the large-scale detection pipeline.

Review of Image Classification

Since 2012, there has been rapid innovation in the space of image classification. The timeline below shows the progress of the classification accuracy measured by the top-5 error in the ImageNet dataset. EfficientNet is an important milestone, in that it aims to scale the network efficiently. The key idea is that it is more efficient in terms of parameters used, and it achieves the same or better results using fewer training labels. Some of these model architectures are used by Walmart to train its domain-specific neural networks.

Reference: The evolution of image classification explained

Even though image model has come a long way and we definitely leverage image model for classification, for real-world, large-scale classification tasks, the precision would not be satisfactory if we were to rely on image model alone. Plus, if we were to retrain image model due to taxonomy drift, we would be constrained by not only the cost of modeling, but the even greater cost and time of human editors. This leads us to the key idea of k-nearest neighbor (KNN) based classification.

KNN-Based Classification

The KNN-based approach relies on content-based similarity. The illustration below shows how we extract image signature by using a deep learning neural network. Each product is represented by an embedding vector which can have a dimension anywhere from 1,024 to 4,096. We define visual similarity based on the Euclidean distance in the hyper-dimensional embedding space. Similar products should have embeddings close in normal Euclidean distance.

The way KNN-based classification works is pretty straightforward. A query sample is classified by a majority vote of its K nearest neighbors based on Euclidean embedding distance. Research has shown that as the number of data points in the KNN index and the value of K increase, the error rate of the KNN method approaches the optimal Bayes error rate.

Now let’s walk through the KNN approach in more detail.

Embedding Generation: Given the golden dataset for the image classification task, we first generate its semantic embedding for each image via a neural network model pre-trained on large computer vision datasets (e.g., ImageNet).
Indexing: The image embedding vectors generated from the previous step will be indexed into a database that supports efficient KNN search.
Searching: At prediction time when a new image arrives, we will first generate its embedding via the same pre-trained model used to construct the KNN search index, and then find out its k nearest neighbors.
Weighted Majority Voting: Inspired by the Softmax function widely used in the deep learning community to map a real vector into a simplex polytope (a.k.a., a probability distribution), we choose the weighting function to be exp(-d) with d as the Euclidean distance, so that neighbors closer to the query vector will have a larger influence on the decision. For each label , we sum weights from neighbors sharing the same label as the vote result. The label collecting the most votes would stand out to be the predicted class, with its votes cast as the confidence after normalization.

Example: *class* ***stool*** *predicted with score 0.7304 (= 0.1890 + 0.1710 + 0.1400 + 0.1267 + 0.1037)*

In practice, we could also impose a threshold on the confidence score to only output the prediction when we have enough confidence. By doing that, we will have the flexibility to achieve the best trade-off between recall and precision.

In order to optimize the value of K, we picked a model trained with furniture images. Furniture is one of the first few segments that are well-researched for its potential impact on online customer engagement. Also, for this segment, we invested in high quality curated product types that can be used as KNN golden data. Our target to beat is a DenseNet model trained with the same high-quality labels. Ultimately, we used the recall-precision of the KNN approach to determine the optimal value of K. With the optimal value of K as 7, we determined the threshold value of the Softmax score as 0.8. This gives us slightly better performance than the DenseNet model. To be clear, the image model used by KNN for embedding has a lower performance than the DenseNet model. The improvement comes from the KNN classification algorithm, as well as the high-quality golden dataset.

Outlier Detection

As part of our overall proposal, we now consider combining KNN and outlier detection. It is a common technique to use outlier detection during the data collection phase. The benefit is to clean up training data in an unsupervised fashion, and the findings should be easy to interpret. The idea of isolation forest is that the outliers are easier to isolate compared to inclass data. We picked isolation forest because

It is not based on spatial proximity, so no point-based distance calculation
It can be both supervised or unsupervised
It is scalable to high dimensions
It is easy to interpret

Reference: Detecting and preventing abuse on LinkedIn using isolation forests

The table below shows the potential impact of outlier detection. We constructed one dataset for Bar Stools, and another for Armchairs & Accent Chairs. Both contain a small number of wrong product type labels in them (less than 10%). With everything else being the same, we did two experiments using isolation forest by varying the estimated outlier percentage. For the first experiment, we set the estimated outlier percentage to 25%; and for the second experiment, we set the estimated outlier percentage to 15%. The table shows the recall percentage of the wrong labels.

It is intuitive that the recall increases as the estimated outlier percentage increases. Also, for a more homogeneous product type like Bar Stools, it is informative to get a relatively better recall than Armchairs and Accent Chairs. More experiments will need to be done to determine the optimal threshold for estimated outliers.

Validation

In this section on validation, we will show the value of our misclassification detection approach. It is important to realize that at the end of the day, we need to corroborate across different modalities. Our mentality is not to pick a single winner, but to see how the strength of each approach can be used together to detect misclassification. For best possible results, we compare three approaches, using text model, image model and KNN classification. The text-based model is a preproduction model trained over catalog data. The image model is the EfficientNet model trained over catalog images. For KNN embedding, we pose the following question: how does a pretrained state-of-the-art model trained over a generic dataset like ImageNet compare to a model trained over domain-specific images? For a fair comparison, with everything else being equal, we picked the pretrained EfficientNet model and an internally trained EfficientNet model.

In the illustration below, the overlapping area indicates misclassification detected by more than one model, with the same predicted product type. For example, in the Venn diagram on the left, using the pretrained EfficientNet model as the backbone for the KNN embedding, we found that 3.2% of items are classified as the same product type by all three models. To be clear, this 3.2% is misclassified, with support from all three models. Similarly, in the Venn diagram on the right, we found that 3.8% of items are misclassified, supported by all three models, if we switch the backbone for the KNN embedding to use the EfficientNet model trained over catalog images.

**Left**: Pretrained EfficientNet Model; **Right**: EfficientNet Trained Over Catalog

For more robust measures than simple percent agreement calculation, we turn to Cohen’s kappa. This technique was first published by Jacob Cohen in 1960. Without going into too much detail, Cohen’s kappa takes into account the possibility of the agreement occurring by chance. In the context of misclassification detection, Cohen’s kappa is basically a score of how much consensus there is in the misclassification ratings given by a pair of models.

For each product type, we calculate the pair-wise kappa based on the agreement of misclassification tags. The numeric score is then mapped to a bucket for human interpretation. For example, let’s say that Text-and-KNN might have a kappa score of 0.29, or Fair agreement level, for Tank Tops, but 0.95, or Almost Perfect level, for Blue Ray & DVD Players.

For easy understanding of the pair-wise model agreement, let us use the example of TV shows like America’s Got Talent. You can think of the different classification approaches of text, image, and KNN like the judges of the talent show. The space of the product types is similar to the space of talent categories. For each category of talent shows, we calculate the pair-wise kappa based on the agreement of the judges. In the end, we plot the levels of pair-wise agreements on the X-axis, and the fraction of categories at each level on the Y-axis.

Based on the pair-wise model agreement, we can see a higher level of agreement between image and KNN when using the EfficientNet model trained over catalog images for the KNN embeddings (above plot on the right). To the extent that KNN is inherently based on similarity between embeddings, this higher level of consensus is expected. However, we don’t necessarily want judges that are almost always in agreement, right? The motivation of a different approach altogether is to detect wrong product types that would otherwise have been missed. Although the KNN approach relies on image model performance, but the strength of the KNN approach does not seem to rely on the domain-specific factor of the image model. This is manifested by the fact that the level of agreement between text and KNN holds regardless of the image model used for the KNN embeddings. This is good news in the sense that we don’t need to rely on the model trained over catalog images; and for KNN, using pretrained models seems to be good enough.

Large-Scale Detection Pipeline

Last but not least, let us briefly touch on our large-scale detection pipeline. This is a generic architecture to support distributed inferencing. There are three components in this pipeline. First on the left side, we support data sources from cloud, on-premise or streaming data sources. In the middle, we use a streaming platform like Kafka to orchestrate the data flow and support multi-tenant use cases. We have the service stack on the right side that consists of the streaming consumer, a model inferencing server, and KNN search service collocated on a single node. The service stack is the basic unit of deployment, and it is tuned to maximize the throughput on a single machine. The pipeline is also designed to scale horizontally by dynamically adding more service pods.

Summary

To summarize, the key idea is to first generate semantic embeddings for labeled images, then at inference time when a new image arrives, we find its nearest neighbors in the embedding space and conduct a weighted majority vote to predict its label. We have successfully applied this method in production to extract Walmart store and online grocery items’ product types based on their images, and found our approach is highly capable of flagging and fixing items with problematic product type tags in our current production catalog.

Acknowledgement

This work is an ongoing joint project between the Catalog and Search teams from Walmart Global Tech. Special thanks go to Matthew Mu for his initiative and Brian Seaman for his guidance.