Matching duplicate items to improve catalog quality

How Coupang uses image and text deep embeddings with FAISS to index and search millions of items for item matching

Coupang Engineering
Coupang Engineering Blog
10 min readJul 1, 2022

--

By Eway Zhan and Amir Aghamousa

This post is also available in Korean.

Have you ever come across multiple versions of the same product in your search results while shopping online? This is a common problem in online marketplaces that not only damages the customer experience, but also negatively impacts the retailer by occupying online real estate, obscuring seller competition, and damaging catalog quality.

At Coupang, we call this phenomenon duplicate items. In this post, we will share how we minimized duplicate items at Coupang by leveraging image and text embeddings in our item matching system.

Table of contents

· Detecting similarity
· Matching as search
· Experiment setup
· Experiment results
· Service architecture
· What’s next?

Detecting similarity

Humans can quickly detect duplicate items by comparing them for similarity and recognizing product identity. For computers to detect similarity and identity, item information (item images and titles) must be transformed into numerical representations that can be comprehended by machines. In addition, item information must be transformed in a way that reflects human perceived similarity and identity.

In essence, we are attempting to quantify human perceived similarity of images and text into numerical representations, which usually take the form of vectors. For item comparison, we want the distances between these vectors to convey a sense of similarity that humans detect.

The traditional approach to generating vector representations uses word count histograms as vectors for text inputs. Although this approach has been modified by more than 20 years of tweaks and tricks, its core idea does not deviate much from the word count histogram, and it limits the optimization of vectors to be tailored to a specific task or dataset.

The contemporary approach uses embedding vectors, usually deep embedding vectors generated by deep neural networks (DNN). The largest difference between the traditional approach and this contemporary approach is that embedding vectors can be optimized by DNNs and customized for improved performance to specific tasks and datasets. In addition, embedding vectors can be calculated for both text and image inputs in the same pipeline.

Image and text embedding vectors by DNN
Figure 1. A chronological overview of how embedding vectors are calculated.

An embedding vector is obtained via a mapping from a specific domain into a Euclidean metric space. A deep embedding vector is derived through the nonlinear mapping of DNNs. By mapping images and texts into a metric space with well-defined distances, the similarity of two vectors can be efficiently calculated.

Another advantage of embedding vectors is that they can be computed efficiently by leveraging the massively parallel architecture of Graphic Processing Units (GPUs). This parallelism can be realized not only on the device level, but also on the level of machines and clusters of machines. These efficient calculations are key to building a platform using deep embeddings by facilitating effective storage, retrieval, and organization of embedding vectors.

A parallel structure for training image and text embedding machine learning models
Figure 2. The parallel structure of GPUs allows efficient computation for millions of items whether it be images or text.

In this project, we compare the recall of deep embedding approaches to the traditional vector space approach as it is already available in Elasticsearch (ES). We assume that a deep embedding model has been trained to our satisfaction, and its training and optimization will not be further discussed here.

Matching as search

Matching is difficult to scale because it has a quadratic time complexity. It requires a comparison of one item against every other item in a set, for all the items in the set. Obviously, this was a large hurdle for us since each category at Coupang has millions, or sometimes even hundreds of millions of items.

Due to such complexity, we cast matching as a search problem. Given an input item, we retrieve a subset of matching candidates that are most likely to be true matches to the request item. Subsetting is particularly effective when there is only a small number of items that match the input item. By transforming the matching problem into a search problem, we can then optimize for recall to enhance matching accuracy.

Experiment setup

In our experiments, we compare our baseline approach, a traditional vector space model using ES, to our alternative approach, a deep embedding vector similarity model using FAISS (Facebook AI Similarity Search).

Baseline model

First, let’s address why we need a baseline model. Remember our goal is to improve recall for item matching. To obtain recall for millions of items, we must obtain true positives for all items in the category, or at least, for a subset of request items that we are examining. Although possible on a small scale, finding true positives for a large category is unrealistic as they must be manually audited.

Instead of manually auditing all matches, we can compare the results of our baseline and alternative models on a controlled number of matching cases. If our alternative model using the embedding vectors can provide more true matches than the baseline model, we know that we have enhanced recall. The baseline model is essential for measuring the performance of our alternative model, because we use its results instead of auditing millions of matches.

ES uses an inverted index as its main data structure. An inverted index allows for efficient computation of word count histograms for both request and database texts, hence also their similarity. We will not go into the structure of our baseline model, but more details on the experimental pipeline will be discussed in the following section.

Alternative model

Now let’s go over the details of the alternative model that uses text and image embedding vectors with FAISS.

We first obtain the image and text embedding vectors from our optimized neural network model. The image vectors are obtained from item images and text vectors from item titles. Once our item images and texts have been vectorized, we must compute their similarity to find duplicate items.

FAISS facilitates efficient vector similarity indexing and retrieval for hundreds of millions of items. We create a FAISS index for both our image and text vectors, giving us two indexes. When we input an embedding to FAISS, it goes through the index to output the index of the vector with the highest similarity to the input vector.

Clustering embedding vectors using FAISS
Figure 3. By clustering true matches using FAISS, we can better organize match candidates and exhaust potential candidates at a given threshold.

Another advantage of FAISS is that it can output clusters of true match candidates that surpass a given similarity threshold. Unlike organizing candidates by rank as in a conventional retrieval method, FAISS groups match candidates by their similarity level. Clustering provides significant index size savings and retrieval speed because we obtain better candidate organization and can exhaust candidates above a given threshold.

For example, there are over 50 million items in our consumer electronics category. Without clustering, we would have to index all these items and retrieval needs to compare all these items to return a certain number of top candidates. With a clustered data structure, there is only about 17 million clustered items that need to be indexed and searched each time, resulting in a 3.29 multiple of storage and computational savings.

Pipeline

Experiment comparing Elasticsearch and FAISS models to find matched items
Figure 4. The experiment pipeline compares the ES baseline model and the FAISS alternative model results.

Now let’s go through the experimental procedures and explore how we compare the results of our ES baseline and FAISS alternative model. Figure 4 shows the overall experimental framework. Below is a step-by-step overview of the experiment process.

  1. Select a random set of input items.
  2. Obtain n number of matching candidates from the ES baseline model.
  3. Obtain n number of matching candidates from the FAISS alternative model. The FAISS model outputs n candidates for the image input and n candidates for the text input.
  4. Audit the true match counts for candidates from the baseline model and candidates from the alternative model using the matcher, which is a binary classification model that judges whether the candidates are matches or not.
  5. Compare true positive counts between ES, FAISS image embedding, and FAISS text embedding approaches to calculate recall enhancement.

Experiment results

We ran this experiment comparing the ES baseline and text and image embedding models for millions of items in each of our categories.

Beauty and food categories

Recall enhancement in beauty and food categories
Figure 5. The results of our experiments on the beauty and food categories. The percentages in the Venn diagram represent overlapping true positive percentages. We saw remarkable recall enhancement in both categories.

The left Venn diagram in Figure 5 summarizes the experiment results of the beauty category, which contains 20 million items. Both the image and text embedding methods detected additional matches to the ES baseline. The true positive matches found with image embeddings overlapped with the baseline by 57% while those found with text embeddings by 35%.

Additionally, an interesting phenomenon was that there was only a 12% overlap between the true positives of the image and text embedding vectors. We called this the complementary property, borrowing a term used in ensemble learning where a set of weak learners with complementary functions are combined to form a strong model.

We also conducted the same experiment for the food category, which has approximately 17 million items. In addition to the ES true positives, image and text embeddings produced 73% and 36% complementary true positives respectively. The true match pairs of image and text embeddings is only 16%. Most importantly, the recall improved by 106% with the joint true matches.

Result analysis

The recall enhancement varied among other categories, but overall, we found that the alternative model provided more true matches than ES by a meaningful margin. These results gave us confidence that our alternative model could provide a reliable foundation to build an improved matching system.

Furthermore, we analyzed the complementary property observed between the text and image embedding true positive match pairs. We wanted to know why the true positives between the text and image embeddings very rarely overlapped.

Similarity histograms for analyzing true match pairs between mage and text embeddings
Figure 6. We created a similarity score histogram for image embedding true match pairs that are not in text embedding true matches. We also created a similarity score histogram for text embedding true match pairs that are not in image embedding true matches.

To understand the complementary property shown in our experiments, we examined items that were true positives in either the image or text embedding while not in the other. We found the identity information of items can be matched using text or image, but not necessarily both. Figure 7 shows examples of such items.

The first example on the left illustrates a true match pair with a high image similarity but low text similarity. As you can see, the two different items are the same product, but the identity information of the product is clearer in the image than the text. There is little overlap in the text of the two items, but the images are almost identical.

The second example shows a true match pair where the image and text convey an equal amount of identity information. The last example shows a true match pair where the item identity information is expressed more clearly through text than image.

Examples of where text and image similarities are vastly different and vaguely similar.
Figure 7. Three examples of where text and image similarities are vastly different and vaguely similar. Due to these characteristics, there is little overlap between image and text embedding true positives. The brand names of the products have been redacted.

Service architecture

To leverage the recall enhancement and complementary property of the embedding models in our service architecture, we designed the text and image embedding pipelines to run in parallel to our ES baseline.

Service architecture and platform for item matching at work at Coupang
Figure 8. The service architecture and platform at work at Coupang. We managed leverage the complementary property of the different models to our advantage in this sophisticated matching software.

Currently, the union of the candidate sets from each pipeline is the resulting true match set of the system. In the future, we want to incorporate a combination module that can use additional item attributes, such as color or size, to improve accuracy.

Our item matching system is made up of three major components: indexing, search, and monitoring. We provide these components for the image and text pipelines to production grade efficiency, scalability, and reliability. Both the indexing and search pipeline system can handle 3,500 requests per second. The range of query latency is between 100ms to 1 sec with TP90 being 900ms.

The indexing and search pipelines of the item matching system at Coupang
Figure 9. The indexing and search pipelines of our matching system. These pipelines have been designed to be efficient, reliable, and scalable.

This system has been applied to catalog item matching to reduce item duplicates and to search query results to analyze query similarities. We have plans to adapt and extend it to other areas, such as analyzing retail fashion visual similarity for product discovery and matching item images for price comparisons.

What’s next?

We asked ourselves whether it was possible to emulate human recognition capabilities using machines to the scale of millions of items. Looking back, we feel that we are lucky to have made significant progress towards that goal. We want to express our gratitude to the teams and leaders who cooperated with us on this project.

However, we’re not satisfied yet. We are considering the development of a master API for embedding similarities and discrete categorical attributes, as well as an SDK to enable other teams to adopt and extend the system. Furthermore, as applications and use cases increase, we aim to build a model management module to support a variety of DNN models for storage optimization and cost efficiency.

If you think you have what it takes to improve the results and methods presented here, join our team!

--

--

Coupang Engineering
Coupang Engineering Blog

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.