Detecting Near-Duplicates in the News

Martin Boyanov
Commetric
Published in
2 min readMar 21, 2019

Overview and benchmark of data science methods for duplicate detection

Near-duplicates are a real issue when indexing the news. The fast-paced news cycle has lead to fast-paced news propagation: when a story breaks, it is replicated almost identically to a number of other content providers. However, it is not exactly the same: a word or two have been changed, a sentence has been rewritten, custom headers or footers are sprinkled and sometimes advertisements creep in the middle of the text.

At Commetric, we do custom media analysis on behalf of our clients. We ensure the highest quality of reporting by assigning humans the task of the actual analysis, but empowering them with AI tools to aid them in their job.

Our estimates indicate that 30–40% of news stories are reprints. Grouping and managing them effectively can decrease the time analysts spend reading the news and increase the quality of the final report.

Organizing and managing similar documents can allow the analysts to focus on the content.

Thus, we set out to evaluate and benchmark different strategies for near-duplicate detection. Most of the techniques are covered in Chapter 3 of the wonderful book “Mining of Massive Datasets”. The following sections will provide a broad overview of the techniques. Please refer to the book and to the notebook version of the article for details.

Methods

The following methods will be evaluated:

  1. Brute-Force
  2. Length-based filtering
  3. Prefix-based filtering
  4. Minhash-LSH via 9-char shingles
  5. Minhash-LSH via unigram shingles
  6. Minhash-LSH via bigram shingles
  7. Minhash-LSH via trigram shingles
  8. Minhash-LSH via filtered trigram shingles

The Minhash-LSH functionality is due to the amazing datasketch package.

Similarity

For the purposes of this article, we will define that two documents A and B are near-duplicates if the Jaccard similarity of their trigrams is at least 0.8.

Experiments and Results

The experiments were carried out on a proprietary dataset of ~3000 documents. The results were as follows:

As you can see, we achieved a 40x speedup over the brute-force approach and decreased the number of comparisons by 162x!

Please check out the notebook and the book for details on each of the methods.

--

--

Martin Boyanov
Commetric

Data Scientist passionate about NLP and Graph Modeling