Detecting Near-Duplicates in the News

Published in

Commetric

2 min readMar 21, 2019

Overview and benchmark of data science methods for duplicate detection

Near-duplicates are a real issue when indexing the news. The fast-paced news cycle has lead to fast-paced news propagation: when a story breaks, it is replicated almost identically to a number of other content providers. However, it is not exactly the same: a word or two have been changed, a sentence has been rewritten, custom headers or footers are sprinkled and sometimes advertisements creep in the middle of the text.

At Commetric, we do custom media analysis on behalf of our clients. We ensure the highest quality of reporting by assigning humans the task of the actual analysis, but empowering them with AI tools to aid them in their job.

Our estimates indicate that 30–40% of news stories are reprints. Grouping and managing them effectively can decrease the time analysts spend reading the news and increase the quality of the final report.

Organizing and managing similar documents can allow the analysts to focus on the content.

Thus, we set out to evaluate and benchmark different strategies for near-duplicate detection. Most of the techniques are covered in Chapter 3 of the wonderful book “Mining of Massive Datasets”. The following sections will provide a broad overview of the techniques. Please refer to the book and to the notebook version of the article for details.

Methods

The following methods will be evaluated:

Brute-Force
Length-based filtering
Prefix-based filtering
Minhash-LSH via 9-char shingles
Minhash-LSH via unigram shingles
Minhash-LSH via bigram shingles
Minhash-LSH via trigram shingles
Minhash-LSH via filtered trigram shingles

The Minhash-LSH functionality is due to the amazing datasketch package.

Similarity

For the purposes of this article, we will define that two documents A and B are near-duplicates if the Jaccard similarity of their trigrams is at least 0.8.

Experiments and Results

The experiments were carried out on a proprietary dataset of ~3000 documents. The results were as follows:

As you can see, we achieved a 40x speedup over the brute-force approach and decreased the number of comparisons by 162x!

Please check out the notebook and the book for details on each of the methods.

Detecting Near-Duplicates in the News

Methods

Similarity

Experiments and Results

Written by Martin Boyanov