How our Data Products team built an article deduplication tool based on NLP

Jose Del Rio
Reach Product Development
9 min readFeb 14, 2020

Content in this article: Text scoring, NLP, word2vec, text similarity

Have you ever come across a published article and had a ‘deja vu’ feeling of having read the article or a very similar article already? Reach plc has more than 60 national, regional and local news titles across the United Kingdom. Therefore, it is useful to provide a tool to automatically score similarity across these published articles. This is the perfect task for Natural Language Processing. NLP in action!

Photo by Mr Cup / Fabien Barral on Unsplash

With more than 70k articles published each month, this is a task no human can do (without serious consequences for their mental health), so let’s give this task to the machines.

Sadly, machines don’t understand text — or do they?

NLP is an interesting branch of machine learning where the machine performs operations with text that try to replicate the way a person reads and understands text. The tasks we can tackle with NLP include…

  • Extracting entities (names of people or organisations)
  • Answering questions as a person would normally expect
  • Classify articles in categories (sports, politics, economy…)
  • Classify a text based on sentiment or intent

The challenge

The main task consists of avoiding duplication across articles. This has the direct benefit of improving user experience by increasing the variety of articles provided, either in curated publications or in recommendations generated by the system.

The complexity of this problem can be reduced to the following steps, which I will refer to throughout this article:

  1. Clean the article data
  2. Filter the articles to a subset including only relevant articles
  3. Train a model to score similarity across articles and paragraphs
  4. Store our scores in a database
  5. Provide a UI for quick access and evaluation of the scores

And the overall architecture to tackle this challenge includes the following modules:

  • Lambda service, to ingest the articles as they are generated
  • ElasticSearch, to store and pre-filter most similar articles
  • NLP model trained on the subset of articles selected
  • PostgreSQL to store the document and paragraph scores
  • Web app to provide an interface
Architecture diagram

Now let’s understand what these steps involve and why we are using these modules to solve our challenge.

Back to the basics

The first step is the boring but unavoidable step of cleaning the data. Happily, in this case we already have all the articles text normalised and waiting for us in a database!

Cleaning operations (Photo by Tim Mossholder on Unsplash)

If not, this would have been the moment for dusting off those skills we haven’t used for a while and applying some of the classic text cleaning and preprocessing operations:

  • Remove special characters
  • Remove punctuation
  • Remove stop words
  • Apply lower case
  • Lemmatisation or stemming
  • And so on…

What to compare?

We are interested in analysing new articles and comparing them with possible ‘originals’, but we don’t want to waste resources comparing articles that are completely different. That’s why we build a filter (step 2) to extract only articles that show a certain degree of similarity.

To perform this task, we require a scoring metric that is going to evaluate how similar two or more articles may be. With hundreds of thousands of articles to compare, one of the main requirements for this metric is going to be performance.

Time to compare (Photo by Glenn Carstens-Peters on Unsplash)

Luckily again we can borrow from the existing algorithms already developed and used in search engines. These metrics use TF-IDF (Term Frequency–Inverse Document Frequency) which reflect how important individual words are a document belonging to a collection of documents. Basically in TF-IDF the value given to a word increases proportionally to the number of times a word appears in the document and the value is offset by the number of documents in the collection that contain the word. In other words, TF-IDF allows us to value the relevance of a text match.

An improvement over TF-IDF is Okapi BM25, where BM stands for Best Match. This algorithm is the result of the application of probabilistic information retrieval. This could be a post on its own, so to keep it short, Okapi BM25 provides a more robust scoring metric of the similarity across documents. And that is our ultimate goal, to understand how similar two or more documents are in our comparison tool.

It is worth to mention that these filters are already built in to many databases. We have used ElasticSearch where we can choose the similarity that we want to use (BM25 which is the default, classic, boolean, etc) and we have used the cosine similarity for this task.

In the following code for the ElasticSearch index definition, we set the similarity on the field level when the fields are first created. article_body uses the default BM25 and is_published uses boolean similarity:

PUT selected_index
{
"mappings": {
"properties": {
"article_body": {
"type": "text"
},
"is_published": {
"type": "text",
"similarity": "boolean"
}
}
}
}

The goal of this step is to filter and reduce the number of documents to analyse. Choosing an adequate threshold for this filter will require feedback from users so that we fulfil their requirements. Once we have the best candidates for our comparison, we can perform a deeper comparison using more advanced and fancy methods that require heavy computation.

Enter embeddings

To obtain a more granular comparison, we apply word embeddings. In the last few years, word embeddings have become one of the strongest trends in NLP. Using word embeddings, we create a numeric representation of each paragraph and compare similarities across all paragraphs. This works because the vectors obtained using word embeddings correlate with the semantic similarity of the words.

There are multiple libraries available to create word embeddings and it is possible to use pre-trained models as well. Some of the most known libraries in Python include:

  • Word2Vec
  • GloVe
  • Gensim
  • PyDSM
  • DISSECT

If you haven’t heard of embeddings before, we can backtrack to the 1990s when vector space models were already in use and different models were developed to estimate continuous representations of words. Two examples of this are Latent Semantic Analysis and Latent Dirichlet Allocation. Later on, the phrase ‘distributed representations of words’ is used in papers. These are trained in a neural network model alongside the parameters of the model.

Finally, word embeddings became hugely popular thanks to the libraries mentioned before, such as word2vec (2013) and GloVe (2014), which make the creation and use of pre-trained models relatively simple.

Now that we understand embeddings, we are going to train a model using word2vec. To perform the training, we must create a corpus (group of documents) which in this case includes the article for which we are running the comparison and its candidates.

Finally we run the training with the selected parameters (learning rate, number of cycles/epochs, etc.)

def train_doc_2_vec_model(corpus):

model = Doc2Vec(vector_size=VEC_SIZE,
alpha=ALPHA,
min_alpha=MIN_ALPHA,
min_count=1,
dm=1)
LOG.debug("Building vocabulary + training doc2vec...")

model.build_vocab(corpus)

for epoch in range(MAX_EPOCHS):
model.train(
corpus,
total_examples=model.corpus_count,
epochs=model.epochs)
model.alpha -= DECREASE_RATE # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate

return model

Once the model has been trained we call model.docvecs.most_similar(ref) specifying the reference of the paragraph for which we want the similarity scores and we store this information in a database.

At this point, it’s a question of creating a frontend to retrieve the information at the database and give it a nice-looking front page.

Visualisation

We must not forget the main goal, which is to provide a seamless tool for non-technical users in which the possible duplication can be identified visually and navigated easily.

Example of duplication comparison (REACH PLC)

This tool provides the comparison and the scores across articles as well as the interface to compare the articles sentence by sentence.

Article Comparator with scores by relevant paragraphs.

Future Work

Other ways to extend the functionality include the use of pre-trained models, which are typically trained on Wikipedia, or sources like Twitter and common crawl datasets. This gives to these models a built-in knowledge impossible to achieve otherwise:

Going even further with embeddings, it is possible to use the more evolved successors of word2vec. If you have followed the news in NLP in the last couple of years, you have probably heard about BERT (Bidirectional Encoder Representations from Transformers) developed at Google in 2018 and a direct descendant to GPT (Generalised Language Models). BERT was a significant change from the previous models as it was the first deeply bidirectional, unsupervised language representation.

And one of the latest developments in NLP is ALBERT (A Lite BERT for Self-Supervised Learning of Language Representations) which improves parameter efficiency by sharing all parameters across all layers.

Note that when using these models, is preferably not to remove the stopwords and numbers that we were removing in the case of word2vec model.

How to use the word embeddings? Thanks to them, we can translate the sentences to compare to a numeral representation, which is then utilised for comparison. Once we have a vector of each sentence, we apply a similarity metric to measure how similar these vectors are:

  • Cosine similarity, which has an interpretation as the cosine of the angle between the two vectors and is invariant to the scale of the vectors
  • Pearson similarity, which is similar to the cosine but is invariant both to shifts and to the scale of the vectors

We’ll be using cosine similarity for our scoring system for its simplicity and clear correlation with the goal we are trying to achieve.

To be able to compare sentences with different lengths, one classic technique is to obtain the average of the embeddings of the tokens (words) in that sentence. This way the length of the final vector is always the same and we can compare them using the cosine similarity. On the other hand, it is worth mentioning that if we are measuring the difference in the angle of the vectors, then the magnitude doesn’t matter, so averaging is just an extra computation that we can avoid.

Again, some testing will provide invaluable feedback about which metric is best to apply in your particular application.

Lessons Learned

Using the embeddings provides a very powerful tool to compare text semantically as the model, on its own, has learned about synonyms and words that are very close in significance. You don’t need to be a linguistic expert.

Training models specifically built for a particular group of documents creates a customised model that achieves very good results when the group of documents and the associated lengths are big enough. But in the scenario of a very small group, the model won’t be able to learn much and the scores become less distinctive. This can be tackled with the implementation of a dynamic threshold for what is considered similar, depending on the size of the corpus used on the training phase. Another issue is the reproducibility as the models trained depend on the random initialisation (this is easily solved by setting the seed for the randomisation).

Using pre-trained models achieves more stable results as the embeddings generated from such model have been learned from a huge pool of articles and that will ensure that the same weight is given to the same word in the same context. There are no worries here about reproducibility as the model is always the same.

Last, a reminder about good measures. Applying the latest cutting-edge technologies is cool, but is not necessarily the best option. Depending on the objective, there may be solutions that are more simple than the latest models. With that in mind, there are useful metrics, either on their own or combined, that could be an accurate metric for comparison:

Jaccard Index: Measures the number of common words over all words.

Overlap Coefficient: also known as the Szymkiewicz–Simpson coefficient, is defined as the size of the union of set A and set B over the size of the smaller set between A and B.

Other metrics to compare similarities include Sørensen–Dice coefficient, Tversky index, Tanimoto distance and bag distance. All these metrics have been developed to solve specific problems. Going through all of them is not the goal of this post, but it is good to remember that there are a wide range of tools already available.

Thanks to Elektra Papazoglou and Jamie Hooper for their work and contribution both in the Deduplication project and reviewing this article.

--

--

Jose Del Rio
Reach Product Development

Senior Data Scientist. Making machine learning more approachable.