Content deduplication

Jgleyzes
EcoVadis Engineering
7 min readAug 31, 2022
Image by Markus Spiske on Pexels.

With increasing access to data from an increasing number of sources, getting the same content multiple times is becoming very likely. While it may not appear harmful at first sight, it can have negative impacts:

  • On efficiency: there is no point in storing and/or processing the same data multiple times.
  • On performance: in the case of metadata, like a supplier name, having multiple variations of the same supplier can prevent us from identifying two documents as being from the same source.

These problems can be mitigated by deduplication. Deduplication can be simple, such as detecting that two strings are an exact match, to more subtle, such as understanding that two articles are talking about the same subject.

The latter is a problem we face at EcoVadis: for example, we receive tens of thousands of articles per day, from hundreds of sources. Some of them are exact duplicates, others are different articles talking about the same event.

In order to deduplicate, one needs to have a way of measuring the similarity between two items. In a lot of cases, the similarity measure is applied to text. There are multiple ways of doing this. The simplest extension to pure string comparison is the Levenshtein distance (essentially, count the number of different characters between two strings). The method doesn’t scale very well beyond a few words. Typically, scalable methods involve mapping text to a numerical space, where we can use standard distance measures to assess similarity. The data scientist’s job is then to find the best mapping for the problem: for example, does the mapping need to preserve the order of the words? Does it need to incorporate context (e.g. understand that “company” and “firm” are similar words even though their characters are very different)?

Once we have created “fingerprints” for each item in that numerical space, items will be considered duplicates if their similarity is above a given threshold. The brute force approach of computing the similarity between each pair of items has O(N²) complexity: if we have N items, there are N(N-1)/2 pairs so we need to compute (and store) N(N-1)/2 similarities. It may be fine for low values of N, but as N increases, it becomes virtually impossible to use. Therefore, a smarter way of grouping items together is necessary.

In the rest of this post, I will present a few methods we tried at EcoVadis, both for the fingerprint part and for the grouping part.

Cosine Similarity

In this first approach, the text is mapped to vectors and cosine similarity is used as measure. The latter, which is very easy to compute, varies between -1 and 1. A value of 1 means very similar, -1 means words are opposite, and 0 means words have little in common.

Here are two ways of getting these vectors:

Latent Semantic Analysis

Latent Semantic Analysis is the combination of two methods:

  1. A term frequency-inverse document frequency (TF-IDF) step transforms the text into a vector. To do so, the TF-IDF vectorizer is trained on a corpus of texts, of which it will extract the vocabulary (the set of words used in the corpus). Then, when applied to an item, it will create a vector of the size of the vocabulary, where each entry is the product of the term frequency (how often the word appears in the articles) times the inverse document frequency (how rare the word in the corpus). For exact details about the implementation, please refer to this page.
  2. A Truncated singular value decomposition (SVD). This is a dimensionality reduction technique, as TF-IDF tends to output very long vectors which are not well suited for cosine similarity. Roughly speaking, Truncated SVD retains the most informative linear combination of the original vector dimensions (see details).

Using this, items can be mapped to fingerprints, i.e. vectors of dimension D ~ 200 (the exact number can be adjusted). This is a fast, robust way of doing the mapping. Because of the dimensionality reduction, it can capture meaning (e.g. have the fingerprint of “company” closer to “firm” than “apple”).

The LSA needs to be trained on a corpus before being able to predict fingerprints. In principle, this training could be done only once but as time goes on, there might be drifting in the data that would force us to re-train.

Word Embeddings

Word embeddings are learnt representations of words which try to capture meaning. They are typically trained by masking words in sentences and having the model try to predict it. For example, one can easily imagine that a large corpus will have sentences like “I would like a glass of orange juice”, “Can I get some apple juice?”, etc . Since “apple”, “orange” and other fruits often appear in the same context, the model will learn that fruits should have similar representations.

Illustration of a 3 dimensional word embedding space with the cosine similarities between the vector for Company and other words. Note that Company has a very high cosine similarity with Business (0.9) and Firm (0.85), a lower similarity with Supplier (not all companies are suppliers, but suppliers are companies so the terms are related). Apples do not share a lot with companies, so their cosine similarity is closer to 0.

A popular word embedding framework is Fasttext, which does not work directly with words, but rather parts of words, which means the model can understand that “read” and “reading” share the same root. One can find pretrained models (which can be found here) or train a model on their own corpus.

The advantage of this method is that it should capture context better than LSA. However, it is more prone to overfitting and might not generalize well on new data.

Finding duplicates

Once the fingerprints are obtained, it’s time to group similar items. For that, we can use Pynndescent, which creates a graph of nearest neighbors and allows us to very quickly find the entries that are most similar to a given item (there is a wide range of libraries dealing with this problem, a comparison can be found here).

The final obstacle to overcome is the fact that such algorithms work by querying a number of neighbors, not directly by similarity threshold. One solution is to take a small sample of data, compute how many neighbors on average are above the threshold, and use this as the base query size. When, for a given query, all items are above the threshold, the query size is simply increased temporarily by a multiplicative factor (and recursively until all items above the threshold are found).

The final algorithm is as follows:

  1. Create the index with the fingerprints(either LSA, word embeddings, or other)
  2. Iterate through the list of items:
    a. Find all neighbors above threshold with Pynndescent
    b. Mark those neighbors as duplicates of the original item
    c. Remove item and neighbors from list of items to explore

For datasets below 1 million articles, the whole process (training the vectorizer, computing the index and finding duplicates) is reasonably fast (less than a couple of hours). But beyond that, a faster method is needed.

Note that one could use embeddings coming from large Natural Language Processing models, such as BERT. In principle, these models can better capture context, and could be able to understand that two articles are dealing with the same subject. However, computing embeddings for a large number of items would be extremely resource intensive. That is not even considering the potential need to fine tune a model on one’s dataset. At EcoVadis, we try to limit the impact our AI has on the environment, which is why we chose not to go further with that option.

Locally Sensitive Hashing

Another common way to deduplicate text is to use Locally Sensitive Hashing (LSH, for details, check this post or this lib and the references within). The idea is to create hashes from the text using the minhash function, which ensures that the hashes for similar documents collide more often. The probability that two minhashes collide is given by the Jaccard similarity of the texts (the intersection over union of their words). The LSH algorithm then splits the minhash into bands and does some more hashing. Candidate pairs are identified if at least one band matches (I refer you to this post and the associated code for details).

The advantage is that it does not require training and is very fast. The drawback is that it does not understand the meaning of words (“company” and “firm” won’t be similar).

There are two ways to use LSH:

  1. First build an index and then for each item, use that index to find all the items that are similar enough. This is quite similar to the method based on cosine similarity, the only difference is the way to find the duplicates for a given item.
  2. Once the index is built, one can directly look at the hash tables that were created and consider as duplicates the items with the same hash key. There are typically tens or so hash tables (corresponding to different splits of the minhash). We can sacrifice accuracy for speed by looking at a single hash table. At high thresholds, it becomes similar to method 1 above. For this method though, there is no need to iterate through articles once the index is created, which makes it very fast. For datasets larger than 1 million items, it was faster by a factor of 2.

Comparison and thresholds

In order to find the right similarity threshold and the best method, one needs to have a labeled dataset. In practice, that means having a dataset where the duplicates have been found (either by humans, or programmatically for an artificial dataset). With this, one can run the different models for a range of similarity thresholds, and compare the evolution of metrics related to your use case. We have found that more business oriented versions of precision and recall, such as the percentage of duplicates found and the number of items wrongly considered duplicated, are particularly insightful.

Conclusion

Deduplicating content is an important part of a data scientist’s job. It improves the efficiency of data processing (from data cleaning to actual machine learning model). For the case of text based content, the key is to have a way of measuring the similarity of content which preserves the attributes most relevant to how the content is used (order of words, context, specific keywords, etc). This typically involves mapping the text to a numerical space where similarities can be measured. We have given a few examples of how to do that here, hopefully one of them can be appropriate for your own deduplication needs.

--

--