Product Matching in eCommerce using deep learning

Walmart.com is a marketplace with hundreds of millions of products and thousands of sellers. Given a seller who wishes to set up a new product offer on Walmart.com, we need to determine whether we already carry this product in our catalog. If we don’t make that determination then when a user searches for that product on our website, they may end up seeing duplicates in their search results which is not a desirable user experience. So we need to form groups of identical items sold by different sellers as below.

Seller choice page for a sample item

On the other hand, we need to be precise in the creation of groups and ensure they contain identical products sold by different sellers. Once a group is formed, it is represented by exactly one item page with a single title, single description and a single price on the item page. Although customers can select other sellers and see their prices on the seller choice page as shown above, the individual seller’s product content is not shown on this page. Thus, if the groups are not homogeneous, an individual seller’s product might be different from what is portrayed on the item page which is unacceptable.

Grouping products using universal identifiers

There are universal identifiers such as UPC, GTIN, ISBN, etc that can be leveraged for the purpose of identifying identical products. However, while this provides a good grouping in most cases, in a small percentage of cases we find that the universal identifier information in the incoming data is incorrect. In such situations, relying only on identifiers can lead to incorrect grouping.

For example, forming a group consisting of a new and refurbished versions of the same product or merging a 16gb iPhone 6s offer with a 32gb one. Now the item page can only show one title and one price. So in the case of the last example, we might end up with the title showing a 16gb iPhone with the price of a 32gb one, in which case the demand for that product will be affected on account of the high price. Conversely, if the title shows information for a 32gb iPhone at the price of a 16gb one, the deal might quickly goes viral and we might end up with thousands of orders for the products (worth millions of dollars in revenue) at the reduced price resulting in substantial loss. Both of these scenarios are undesirable. So we need to make a final matching decision leveraging universal identifiers as well other product content. In our solution, we use the product information we deem most reliable — title, description, image and price to make a matching decision.

Leveraging multiple data sources

Identifying identical products is also important to construct the final item page. Products can be described in terms of their features such as brand, color, size, etc. In order to make it easier for sellers to onboard their items, most product features are not mandatory for sellers to provide. As a result, we find that different sellers may provide different features in their product feed. By utilizing different sources of information for the same product, we can increase the coverage of product specifications on the item page.

Sample product specifications on an item page

Challenges

  • As mentioned above, universal identifier based grouping works for the most part, but in cases it doesn’t there might be a significant impact on customer experience and revenue. The low incidence of mismatches makes it hard to obtain a lot of labeled data of matched and mismatched pairs of items since a random sampling will result in mostly correctly matched groups. Randomly picking two different items for a mismatched pair does not fully work either since in a lot of scenarios mismatched items could be almost the same except for one key attribute (e.g. condition, storage size, color, etc).
  • Titles of matching products may not be identical but contain semantically alike tokens. On the other hand, mismatching products may differ on a single attribute and consequently their corresponding titles may differ by as little as one character (e.g. 6 pack Coke vs 8 pack Coke).
  • Incoming attribute data may be missing or noisy.
  • Identical products may or may not have same/similar primary images.
  • High price differential may be a strong indicator of a mismatch but by itself is rarely conclusive since identical products may have highly varying prices across different sellers while certain kinds of mismatched products (e.g. different color or sports team branding) may still have very close or equal prices.

Approach

We propose an approach that leverages product information that we deem most reliable — title, description, images and price in order to arrive at a matching decision. The system consists of several components and depending on the specific use cases, some or all of these components may be leveraged.

  • Title Similarity: Given a pair of product titles, quantifying their degree of similarity.
  • Image Similarity: Given a pair of product images, quantifying their degree of similarity.
  • Attribute extraction/detection: Identifying key attributes such as brand, condition, color, model number, etc from available data for each product and measuring the discrepancies in the attribute values.
  • Price outlier identification: Given a group of offers for a product and their corresponding prices, identifying whether an incoming offer price is an outlier in this price distribution.

We describe each of these pieces below.

Title similarity

Let’s start with an example product with offers from 5 different sellers. We list their titles below:

  • Garmin nuvi 2699LMTHD GPS Device
  • nuvi 2699LMTHD Automobile Portable GPS Navigator
  • Garmin nuvi 2699LMTHD — GPS navigator — automotive 6.1 in
  • Garmin Nuvi 2699lmthd Gps Device
  • Garmin nuvi 2699LMT HD 6" GPS with Lifetime Maps and HD Traffic (010–01188–00)

As can be seen from the examples above, the same product sold by different sellers can have significantly different looking title.

We build a neural network model for estimating title similarity. The system architecture is described below.

Architecture diagram for title similarity

The first layer uses word level embeddings which are pretrained on the entire catalog of all titles (to more effectively handle words not seen in the training data for title similarity). Prior to this training, there is a preprocessing step to identify phrases using pointwise mutual information. This results in treating two or more tokens (e.g. Hewlett Packard) as a single entity for the purposes of computing word level embedding. We trained 100 dimensional embeddings using the skip gram model.

We train a convolutional neural network on the concatenated padded titles (to ensure equal length) with a cross entropy loss function.

As mentioned earlier, it’s hard to obtain a large amount of labeled data for this problem. To circumvent this, we trained the title similarity model entirely on synthetic labels. The training data was created as follows:

  • A small subset of our catalog consists of UPC validated products. For each such product, a pair of titles from different sources was randomly sampled and added as a matched pair. This can potentially include identical title pairs.
  • We also want the similarity measure to capture the scenario where one title for a product may contain more/less information compared to another title. Towards this end, a random title from the catalog was paired with same title potentially dropping some tokens randomly and also added as a matched pair.
  • Two different titles from the product catalog were randomly sampled and added as a mismatched pair.
  • To model the case where a pair of products differing in one key attribute may get matched together due to identical identifiers, we add the following pairs to the training set:
  1. A title containing a particular attribute value is sampled from the catalog. The substring corresponding to the attribute value is identified and replaced with a different value of the same attribute. The original title and the modified title are added as a mismatched pair.
  2. Steps are taken to ensure that the substituted value of the attribute is not (almost) synonymous with the original (e.g. “pre-owned” and “used” values for condition attribute or “vermillion” and “red” for color).
  3. Pairs such as above are currently added for attributes like condition, package quantity, color, sports team, etc based on the distribution of observed mismatches.
  • For symmetry of the similarity measure, for each pair (t1, t2), the pair (t2, t1) is also added with the same matching decision

We experimented with a few of different neural network architectures. The model comparisons are tabulated below.

  • Training set size: 14,797,276
  • Validation set size: 3,699,319
Model comparison for title similarity

A few example results are shown below.

Example results for title similarity

Image similarity

Again for image similarity we faced the problem having insufficient amount of labeled data. This problem is augmented by the fact that corresponding images of the same product sold by different sellers may not be the same (e.g. different perspective, different color temperature, different scale/aspect ratio, etc).

In order to avoid collecting manual judgements for a large number of image pairs, we decided to use an indirect approach for this component. We trained several image based models on an auxiliary taxonomy such as the one that can be navigated on the Walmart website. For example, the following product belongs to the Electronics > TV & Video > Smart TVs node in the taxonomy.

Front end taxonomy

We used several architectures based on their performance on ImageNet data. For the similarity metric, we use the first dense layer of the network as a feature extractor and compute cosine similarity in that feature space. The reason for using multiple models is that we find different models are sensitive to different idiosyncrasies of the product image data. We show the overall system architecture and some examples below.

Block diagram for image similarity
Example results for image similarity

Attribute extraction/detection

Attributes are specific features of products such as brand, model number, condition, color, etc. We broadly distinguish between two categories of attributes:

  • Closed value list: Attributes with a fixed set of values (either by nature or by design). For example, condition, color, book format can be considered in this category.
  • Open value list: Attributes without a fixed set of values, e.g. brand, model number, etc.

For the first set of attributes, we employ a text classification model using a convolutional neural network. The architecture is similar to the title similarity model with the input being only one product title and the last layer replaced by a softmax layer over the class labels.

For the second set of attributes, we set up attribute extraction as a sequence labeling problem. We label titles using the BIO encoding scheme as follows:

Every product title is tokenized and each token is assigned one of the following three labels:

  • B-brand: first token in a brand name
  • I-brand: intermediate token of a brand name
  • O: not part of a brand name

Example:

Manual (B-brand) Woodworkers (I-brand) and (I-brand) Weavers (I-brand) SDPMSAR (O) Mrs. (O) Always (O) Right (O) Printed (O) Pillow (O) Vivid (O) Colors (O) 12 X 12 (O) inch (O)

We use the following neural network model for extracting these attributes. There are several more subtle issues associated with extracting these attribute values. A comprehensive discussion is presented in the paper Attribute Extraction from Product Titles in eCommerce.

Architecture diagram for attribute extraction from titles

A comparison of the above model to more classical sequence modeling techniques and some baseline approaches is presented below for the case of brand extraction. The precision and recall of the bidirectional LSTM network is superior to all other models, although Conditional Random Fields (CRFs) and Structured Perceptron achieve results which are nearly on par.

Model comparison for brand extraction from titles

Price outlier detection

For this component, the goal is to identify whether an incoming offer price is an outlier in the distribution of current product prices of the target group in which the offer is intended to be merged. This test is only applied if the incoming price is higher or lower than all other prices in the group. We set this up as a hypothesis testing problem.

H0: There are no outliers in the combined set of prices

Ha: The incoming price is an outlier

We experimented with the following outlier detection techniques:

  1. Dixon Q test
  2. Grubbs’ test
  3. Chi-squared test
  4. Bartlett test

Chi-squared test is a simple test for outlier detection in 1 dimensional data. We assume a normal distribution for the current product prices in the group and determine whether the incoming price is an outlier for this distribution. The main disadvantage of chi squared test is that it doesn’t take sample size into account.

Dixon Q test and Grubbs’ test factor in the sample size, which makes them more preferable. Even then, if the original set of prices is tightly clustered together, then even a small deviation in dollar value in the incoming price gets flagged. To counter this, we also employ a variance sensitive test — Bartlett test. We use this test to check whether there is a significant difference in the variance of the original set of prices and the variance of the prices with the incoming offer price added.

We currently flag the incoming price as an outlier if both Grubbs’ test and Bartlett test deem it as an outlier at the 0.05 significance level.

The above tests assume that we have at least 2 products in the current group. In the case we have only one current product for comparison, we flag the incoming price based on whether the ratio of the larger to the smaller price exceeds a certain threshold that is exponentially damped as a function of the smaller price.

Final Decision

Based on the results of the individual components, we render a match/mismatch decision. Currently we use a set of rules abstracted from historical data for which we had such labels. A sample of mismatches flagged by the algorithm are evaluated daily and we observe precision (fraction of mismatches flagged by the algorithm that are true mismatches) between 85–90%.

Future Work

The current system was designed with the consideration that labeled data is expensive to acquire in the context of this problem. By exploiting a large amount of unlabeled data, we have been able to build a system that significantly reduces the proportion of mismatched products in our catalog. As we gather more feedback from this exercise, we are naturally increasing the amount of labeled data we have. Some of the directions we want to follow include:

  • Building a model for title similarity leveraging pairs of titles with manual judgement of match/mismatch instead of solely relying on synthetic data.
  • Building a supervised model for image similarity directly optimized for the metric of interest rather than training a model on auxiliary labels and getting similarity as a side effect.
  • Jointly training a model using titles, images and price rather than training individual components to render the final match/mismatch decision.