Semantic Label Representation with an Application on Multimodal Product Categorization

Binwei Yang
Walmart Global Tech Blog
11 min readMar 1, 2022


Semantic Label Representation


At the core of any ecommerce product catalog is product categorization. Accurate product categorization not only has an impact on revenue growth but is also the key to a positive customer experience. The collection of product categories is organized in a hierarchy for easy navigation, and a formally defined taxonomy emerges. Over time, new categories are added, and existing categories are split into more fine-grained ones. This evolution happens naturally as customers put more value on the differentiation of closely related categories.

We are now at a point where the number of categories exceeds thousands, if not tens of thousands. To put things into perspective, ImageNet-21K represents 21,000 classes, and by psychologists’ estimates, there are over 30,000 visual concepts. The state-of-the-art machine learning model has already proved that it can reach a level of performance that exceeds that of humans.

However, research [1] has shown that even though the mistakes made by the machine learning classification model have been dramatically reduced thanks to both clever modeling and data improvements, the severity of mistakes has not changed much. When mistakes do happen, the top model prediction is still embarrassingly far off from the true category. To address this issue, we will first take a step back and come up with a plausible explanation from the perspective of semantic label representation. Then, for semantically meaningful organization of product categories, we will formulate two innovative label representations. Next, we will propose how to construct an auxiliary objective with the help of the semantic label representations in the context of optimizing multimodal product categorization. Finally, we will discuss the experimental results that show significant improvement to product categorization in terms of both the accuracy and more importantly, the semantic similarity of top predictions.

The Need for Semantic Label Representation

A convenient way to represent labels in supervised multi-class training is through one-hot encoding (OHE). Using OHE, a multi-class classifier is encouraged to maximize the probability of the true class with the help of cross-entropy loss. However, intrinsic similarity between the classes is ignored, and essentially each class is treated as an anonymous class that has no relationship to any other class. As a result, loss associated with failure to distinguish between semantically related classes (e.g., cowboy boots and rain boots) is penalized to the exact same degree as the loss associated with failure to distinguish between unrelated classes (e.g., boots and t-shirts). This well-established approach has been perfected to produce high-performance classifiers measured by top-1 accuracy, often at the price of being overconfident and/or less generalizable. The hierarchy-agnostic or semantic-agnostic approach suffers from this fundamental flaw, so when its output misses the target, it is often a case of badly missing the larger point.

In contrast to OHE, label smoothing constructs a target with a probability distribution instead. The true class will be given a weighted score 1-α, where for example, α=0.1, and the remaining weight α will be distributed equally to each of the other classes. This purely mathematical maneuver has been used in many state-of-the-art models [2] and has shown empirical improvement in both model generalization and calibration [3]. A comprehensive survey of this approach and its practical benefits can be found in [2]. However, this approach of uniform label smoothing results in the loss of information about resemblances between instances of related classes.

Formulation of Semantic Label Representation

Labels are human-generated signals that are highly semantic and information-dense. In addition, the similarity ranking among the labels is human interpretable with room for inherent ambiguity. Let’s look at a simple example with three classes, where two of the classes, A and B, are closely related. It might help to think about the classes as A) Cowboy Boots, B) Rain Boots, and C) T-Shirts. For comparison purposes, we show the encoding of example classes in terms of OHE (upper right) and uniform label smoothing (lower right) where α=0.1 in the illustration.

Our first proposed approach is built on the label smoothing concept. We leverage a-priori semantic information to model the probability distribution over classes, such that the target label is represented using probability weights reflecting the similarity between the classes. Specifically, we start with either pair-wise similarity measures or equivalent pair-wise distance metrics. Then, the semantic relationship of the classes can be expressed in terms of similarity matrix or adjacency matrix, both being symmetric, with the diagonal elements in the former all being 1, and those in the latter all being 0. Similar to the semantic affinity matrix proposed in [4], we can construct the label representation from the similarity matrix as shown in the illustration. After normalization so that the sum of the probabilities for each class is 1, the resulting label representation looks like that of non-uniform label smoothing.

Another illustration shows how we could derive the label representation from the adjacency matrix. The transformation resembles how soft targets are constructed in the knowledge distillation formulation, where teacher logits are softened. Kullback-Leibler (KL) divergence loss is used to make sure the distribution of student logits is similar to that of the teacher logits. We will discuss how the label representations are used as soft targets as part of optimizing multimodal product categorization in the next section.

Semantic Label Representation

If our first proposed approach aims to encode semantic similarities in label representation, our second proposed approach is to design label representation that captures feature similarities. To motivate this approach, we will describe an experiment using the synthetic datasets from [5]. The key idea is to use multi-label classification to preserve the semantic relationship between closely related labels.

The first synthetic dataset, with four classes, is generated in such a way that fine-grained features are shared between classes. For example, both Class 1 and 2 have rounded rectangles, and both Class 1 and 3 have equilateral triangles, etc. In fact, two pairs of classes — Class 1 and 4, Class 2 and 3 — do not share any features, whereas every other pair has some feature sharing. A multi-class classifier can be trained without the benefit of prior information about the similarities between the classes. However, given the known shared features, we could use binary codes to represent each class, such that classes sharing a feature will have the same value encoding for the corresponding feature.

  • Class 1: [1, 0, 1, 0]
  • Class 2: [1, 0, 0, 1]
  • Class 3: [0, 1, 1, 0]
  • Class 4: [0, 1, 0, 1]

The labels for the four classes are effectively binarized, and a multi-class task is replaced by a multi-label task. The trained multi-label classifier is encouraged to learn the underlying features, potentially pooling training samples that share common features. We compared the performance of the similarity-informed multi-label classifier and that of the multi-class classifier. The multi-label classifier (the upper curve in the plot below) has a slight advantage in terms of validation accuracy and learning efficiency. The intuition is that if we could somehow encode the common features shared by classes in the label representation, the multi-label approach converges much faster than the multi-class classifier and produces a better classifier.

Multi-label vs. Multi-class (solid lines are TensorBoard smoothed plots of the actual data in gray lines)

We conducted a similar experiment with the second synthetic dataset and found a consistent result. Furthermore, we purposely assigned the “wrong” binary encoding, such that the label representation contradicts the shared features. As might have been expected, we observed that the performance of the multi-label classifier is worse than that of the baseline multi-class classifier. This shows that our second proposed approach is contingent on binarized label representation properly capturing the prior semantic relationship between related classes. For application to product categorization with a large number of classes, it turns out to be impractical to construct the binary codes that precisely capture the feature similarities. We have decided to leave this for future research.

Application on Multimodal Product Categorization

Let’s switch our focus to the application on product categorization. Our semantic-agnostic baseline model uses OpenAI’s CLIP encoders [6] to get embeddings for both product image and product title, fuses the two embeddings with an attention layer, and trains a finetuning network.

To compare our proposed approach of using semantic label representation with the baseline model, we will freeze the CLIP encoders and train only the finetuning layer, which consists of a multilayer perceptron (MLP). We will also use focal loss [7] for the multi-class loss to address imbalanced training data.

Baseline model for multimodal product categorization

For our proposed approach of semantic label smoothing, we use CLIP embeddings for product category names to construct soft targets. For the special case of the labels being natural words, each label is softened using its top 20 nearest neighbors based on cosine similarity between CLIP embeddings of the labels. We combine the focal loss for multi-class loss and KL loss between the logits and the soft targets, and our training objective is to minimize this combined loss.

Multimodal product categorization with semantic label smoothing

Experimental Results

Our training dataset consists of close to 12 million items for 4,350 product categories. The dataset goes through a data cleaning process to remove duplicate items and items with titles and images that are inconsistent. The product category labels are manually curated, some from crowdsources, and others by our in-store associates. The dataset is known to be fairly imbalanced but nonetheless representative of the product category distribution in the catalog.

Next, we train a product category classifier using one of the following approaches to experiment with semantic label smoothing:

  1. OHE (baseline model)
  2. Uniform label smoothing with α=0.1 (semantic agnostic)
  3. Uniform label smoothing with α=0.1 for the top 20 nearest neighbors (semantic-aware)
  4. Label smoothing based on semantic similarities for the top 20 nearest neighbors (semantic-aware)
  5. Combine approach #4 with curriculum learning (semantic-aware)

In order to make the comparison between the semantic-agnostic models and semantic label smoothing models, we will control the weights in the soft label representations as follows:

To be clear, the semantic-agnostic models include the first two approaches, with the first approach being our baseline model. The reason for including the second approach, with uniform label smoothing and also the KL loss, is to establish how much of the improvement might be due to uniform label smoothing. It is necessary to experiment with the incremental changes to the weights, keeping the training mechanics the same in order to thoroughly understand the improvement attributed to the semantic-aware label representation. Moreover, the last approach uses curriculum learning [8] to harden the targets over the course of training. This technique is orthogonal to the choice of the initial label representations, and we combine it only with the fourth approach, label smoothing based on semantic similarities, to explore the potential benefit of curriculum learning.

Each product category classifier will be tested against a hold-out validation dataset that consists of around 10,000 samples. In addition to top-1 and top-5 accuracy, we will measure how many of the top predictions fall within the top 20 nearest neighbors of the ground truth label. This is a simple metric for the severity of the classification mistakes when the predictions are off target. Detailed analysis of the need of hierarchical loss metrics, which we won’t be discussing here, can be found in [9] and [10].

Table 1 summarizes the improvement of the top-1 and top-5 accuracies compared to the baseline model. The semantic-aware approach #3 has the best top-1 performance. Surprisingly, the semantic-agnostic approach #2 has the second-best top-1 performance. The semantic-aware approach #5 with curriculum training has the best top-5 performance.

Table 1: Improvement of Top-1 and Top-5 Accuracies

Table 2 shows the number of top predictions falling within the top 20 nearest neighbors of the ground truth label. For example, the baseline model has only 1.55% of the items with all top-5 predictions overlapping with the ground truth label’s top neighbors, and on average, only 2.25 of the top-5 predictions fall within the range. This metric shows that the semantic-aware approach #4, the closest implementation for our proposed semantic label smoothing approach, is the best model with 4.05 of the top-5 predictions falling within close range of the ground truth label; and the semantic-aware approach #3, an approximation of approach #4 with uniform label smoothing, is the second-best model.

Table 2: Number of Top Predictions as Close Neighbors of the True Label

Based on these results, the semantic-aware approach #4 (label smoothing based on semantic similarities for the top 20 nearest neighbors) is recommended from the perspective of semantic label representation. When used together with another model with the best top-1 accuracy, it helps to strike a balance between the accuracy and the semantic similarity of top predictions. Another takeaway is that the semantic-aware approach #3 (uniform label smoothing with α=0.1 for the top 20 nearest neighbors) seems to be a good approximation of approach #4. When faced with challenges in quantifying the semantic relationship due to ambiguity in taxonomy, we could fall back on qualitative knowledge of the closely related product categories and identify the nearest neighbors without having to rank them in absolute order of closeness. Under this circumstance, the semantic-aware approach #3 allows us to make the most of semantic label representation.


This work would not have been possible without the support from Alessandro Magnani, Ming Sun, and Zepu Zhang. Special thanks goes to Brian Seaman for his guidance in the application of emerging technologies in Walmart Global Tech.


  1. Bertinetto, Luca, et al. “Making better mistakes: Leveraging class hierarchies with deep networks.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
  2. Müller, Rafael, Simon Kornblith, and Geoffrey Hinton. “When does label smoothing help?” arXiv preprint arXiv:1906.02629 (2019).
  3. Liu, Chihuang, and Joseph JaJa. “Class-Similarity Based Label Smoothing for Confidence Calibration.” International Conference on Artificial Neural Networks. Springer, Cham, 2021.
  4. Fergus, Rob, et al. “Semantic label sharing for learning with many categories.” European Conference on Computer Vision. Springer, Berlin, Heidelberg, 2010.
  5. Dippel, Jonas, Steffen Vogler, and Johannes Höhne. “Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling.” arXiv preprint arXiv:2104.04323 (2021).
  6. Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” arXiv preprint arXiv:2103.00020 (2021).
  7. Lin, Tsung-Yi, et al. “Focal loss for dense object detection.” Proceedings of the IEEE international conference on computer vision. 2017.
  8. Dogan, Ürün, et al. “Label-similarity curriculum learning.” European Conference on Computer Vision. Springer, Cham, 2020.
  9. Wu, Cinna, Mark Tygert, and Yann LeCun. “A hierarchical loss and its problems when classifying non-hierarchically.” arXiv preprint arXiv:1709.01062 (2017).
  10. Narayana, Pradyumna, et al. “Huse: Hierarchical universal semantic embeddings.” arXiv preprint arXiv:1911.05978 (2019).



Binwei Yang
Walmart Global Tech Blog

Binwei is a Distinguished Data Scientist at Walmart Global Tech. His current interests span across computer vision and tooling for better ROI on data science.