Remove duplicates from products catalog: product matching

Published in

eMAG TechLabs

7 min readMay 9, 2017

You probably never heard of product matching, but you have seen it in action, for sure, every time you went shopping online. It’s the engine powering price comparison websites and marketplace platforms.

In a few words, given that you have two products from different or the same source, with all the data points attached, you have to determine if the two products are one and the same. For the human mind this is a very easy task, but for a computer it raises lots of challenges.

For instance when you browse eMAG.ro and search for a book you find something like this:

Does it make any sense? Well, obviously the first product matches the last 2 :

Fluturi volumul 3 — Irina Binder
Fluturi Vol III
Fluturi Vol III

Also:

Fluturi volumul 3 — Irina Binder (editie cu autograf) — is different than the other 3 above

Another two products that are the same are:

Fluturi, vol I + II, editia a 2-a — Irina Binder
Fluturi vol. 1+2 ed.2 — Irina Binder

In eMAG we use product matching for lots of purposes:

assisting marketplace vendors with determining if their product is already listed;
matching supplier offers to existing offers or other suppliers;
deduplicating already listed products and much more.

When we started this project we used an agile approach focusing on delivering something of value after each iteration so that we can improve continuously. This has been the most important approach and the feedback received was our most valued resource.

Technologies used

In developing product matching we used lots of technologies, but I will mention just a few.

SOLR 5.0

First of all we used SOLR 5.0 as the core of the matching interface by using the multitude of features related to searching and indexing. Since in eMAG we have lots of products, each with different characteristics, it’s very hard to have an easy to grasp database structure. Consider the types of products below:

Name: Laptop Dell D4800, processor I5 2.0Ghz
Brand: Dell
Memory Ram: 4GB
RAM Frequency: 677 Mhz
Name: Iphone 7–64 GB Space Gray
Brand: Iphone
Colour: Space Gray
Internal Memory: 64 GB

Storing and indexing in SOLR is easy by using the schema-less feature provided by SOLR. Basically we transformed each characteristic name to a normalized value and used the generic fields already defined in SOLR. For example:

Name: name_t
Ram memory: ram_memory_t
Internal memory: internal_memory_t
Where *_t is defined in the SOLR schema.xml as text and stored value.

This schema-less approach allowed us to store all products and their associated characteristics in one set. We also doubled SOLR as a storage engine so that no other databases were required.

Term Frequency — Inversed Document Frequence

For searching a product in SOLR we used the eDisMax function which has an advanced TF-IDF algorithm behind it. TF-IDFstands for Term Frequency — Inversed Document Frequence. Below you can find an easy to grasp explanation about it:

Word Number of apparitions in
”Romeo and Juliet” Number of apparitions in other operas Calculated TF-IDF “Romeo” 7,000 7 7000x(1/7)=1000 “and” 10,000 100,000 10,000x(1/100,000)=0.1

Custom meta-language

During the development of product matching we concluded at one point that we need an easy to use and grasp meta-language for creating various rules which consider the characteristics of products when identifying duplicates. We used the old S-Expressions syntax which is also used in LISP and came out with a language for matching products. It works by allowing us to create rules for comparing two products. Let’s say that we are comparing books and we implement the rule below which tells us if two books are the same or not.

(and
(and
(exists base.isbn)
(exists candidate.isbn)
)
(or
(contains
(filter “/[⁰-9]?/” (getValue base “isbn”))
(filter “/[⁰-9]?/” (getValue candidate “isbn”))
)
(contains
(filter “/[⁰-9]?/” (getValue candidate “isbn”))
(filter “/[⁰-9]?/” (getValue base “isbn”)))
)
)

It translates to something like this:

- If the base product and the candidate product have the characteristic isbn defined
- And
- if the isbn characteristic of the base product contains the isbn characteristic of the candidate product
- or if the isbn characteristic of the candidate product contains the isbn characteristic of the base product

Then match the two products.

We also apply a filter to the isbn so that all unwanted characters are removed. The end result is a rule that matches products which have ISBN’s in the following situation:

978–333–1111, 333–1111, 978/333/1111, etc

Word2vec

Even though SOLR and the metalanguage allowed us to identify lots of duplicated products, the most powerful tool that we discovered is based on Word2Vec, a library developed by Google that transforms words into a multidimensional array. Word2vec is a framework written in C that implements a simple two-layer neural network. When fed with various phrases and text it gives you the ability to map words.

Suppose you have a relation between Italy and Rome and you want to apply the same word relation to France. The result provided by word2vec is Paris. It also works for other types of associations such as king->queen, male->female or swim->swimming, run -> running

We fed the network with all the product names that we had and then we used the network to compare the vectors of two product names eligible for matching. Suppose the following two products:

Iphone 7S Plus — Space Gray — 16 GB
Mobile Phone Iphone 7S Plus 16GB — Gri

The two products have almost identical words, but there are some quirks in there:

one has two additional words “mobile phone”
one has “!6GB”, while the other has “16 GB” (one word vs two words)
one has “Space Gray” while the other has “Grey”

Since in the multi-dimensional space all words have vectorial representations, we can add the vectors of each product and then calculate the distance between the resulting two vectors. If the distance is very small then we can say that the two products match.

The word2vec approach allows us to identify similar products that have words like: mobile phone <-> smartphone, construction set <-> construction toy, lip gloss <-> lipstick with gloss, etc

The approach

As mentioned in the beginning, we used an agile approach for developing the project and we emphasized on feedback and business value first. In our first iteration, instead of providing automatic associations, we provided suggestions. For each product that was being listed in eMAG through various platforms (supplies, marketplace, etc) we provided 5 suggestions to operators that they could use for identifying a matching product. When operators chose one of the suggestions a feedback was registered automatically which we later analyzed and used to improve the suggestions. It wasn’t the full automatic process that we dreamed of, but it was a basic feature that brought business value and allowed us to get our hands on a high valuable feedback. The goal here was to provide suggestions as relevant as possible, we defined the KPI called relevancy, from 100 suggestions provided, if the operator chose the first product in the suggestions 80 times, then we had an 80% relevancy on first position. We tweaked the relevancy by using the solr features such as adding boost levels for several characteristics, adding synonyms and so on.

We then used the suggestions as the base of our automation process. Instead of providing the operators with suggestions we passed the suggestions through a rules system based on the LISP meta-language and the word2vec library. We then measured the accuracy of our results and we focused on obtaining 100% accuracy. We later found out that 100% accuracy is impossible, but an accuracy over 98% is good enough.

Lessons Learned

During the development we had lots of failures with matching and we learned a lot about this.

We learned that SOLR scales really hard when using schema-less indexes and when you double it as a storage. While this approach gave us an edge on delivering results, since we didn’t have to worry about a second storage or a fixed schema, searching through such an index is slow. We later had to add lots of SOLR servers in order to scale the processes of product deduplication. We now use 4 servers with 8 cores, 32 GB of RAM each for indexing and searching 30 million products.
When trying to deduplicate products only some characteristics of the products are relevant while others are totally useless. One of the most important ones are ISBN, part number and quantity
Feedback is the most important part of the process and deduplicating products is very, very tricky. At one point we tried to deduplicate tires and when we started this process we initially thought that there are lots and lots of duplicated products. So we continued the deduplication process until we said that we are done or so we thought to. When we finally asked others about this we found that actually none of the products found as duplicates was actually a duplicate. We skipped one important characteristic, ‘DOT’ — date of manufacture and that was very important from the perspective of the customer.

All in all this has been one of the toughest challenges for our team and really put our skills to the test. We learned about indexing, machine learning and products while delivering powerful tools for associating products.