How to detect text duplicates with Django
Our feedback on cleaning a bunch of documents with PostgreSQL
This article is a part of a series on the making of “Islam, media subject”, a study on the perception of Islam by french national daily newspapers. You can read our first published article here.
When we realized our “Islam, media subject” study we observed that newspapers tended to publish articles having an identical (or almost identical) content. This can be explained by many reasons: an article published in a newpaper’s different editions (like “Africa”, “Asia”, “Europe”, et cetera), a human mistake, or even a bug from the website/archive the article was retrieved from.
But, as we explained in our first article, this small study heavily relies on words/terms occurrences counting. So we figured that those articles’ duplicates would taint our results by making some terms artificially more important than others and, therefore, that they had to be removed.
In this article, I first want to share with you the major errors we made trying to detect and remove those duplicates, so you can avoid them. And after that, I will explain the final solution we chose and how we tweaked it to make it work faster. But first, let me explain our initial setup.
To clarify a bit the next parts of this article let’s see how we defined articles: In our application an Article is an object with a date (the day of its publication) with an headline, an URL, a textual content and, finally, a newspaper of origin (represented in a different class).
Now that we defined those articles, let’s see how not to detect duplicates among them.
How not to detect duplicates
1. The Django “grouping”
Since we were using Django, we first tried to rely on a “django-only” solution (like bellow) by using the Django grouping mechanism and group articles based on their title and date.
Indeed, an article having the same title, in the same newspaper published the same day as another article should be a duplicate, right? Well not exactly. For instance, we saw that newspapers tended to create specific headlines that were being reused among articles with different content (like on situations reports, LIVE articles, et cetera).
Moreover, with this approach, we may miss potential duplicates because if two headlines have punctuation differences or case changes they won’t be detected as duplicates.
Finally, if a legit duplicated article is published on a different day it won’t be detected as a duplicate either.
2. The “PARTITION BY”
At this point, we were pretty sure that we needed to do some grouping on content field itself in order to detect duplicates. But we needed to get rid of the variations occurring on this field and created an SQL query for that.
Here, the “row_number() OVER (PARTITION BY”, is what does the job. It counts the number of rows having the same “homogenized” content and returns the duplicated IDs. And it worked for numerous articles.
But then the obvious truth appeared before us and took our whole sight: duplicated articles can have more than punctuation or case changes. Words or sentences could be added/deleted, orthography could change, et cetera. And this technique couldn’t handle it. Even a single letter change (like a typo fix) wouldn’t be detected.
The right way to detect duplicates
Grouping will never be the right technique to detect duplicates and it took us an eternity to realize it. A thousand face-palms later, we started to look for the right way to detect duplicates by detecting similarities between texts.
Use similarity detection
We were relieved to find out that this problem had already been solved in PostgreSQL by the (rightly named) similarity function (located in pg_tgrm module). This function works by creating tri-grams of characters on the text it receives as an argument. Then it compares created tri-grams together to detect similarities between texts with a similarity score (between zero and one).
As you can imagine, this is not an easy task to compute on your database system. So, if you plan to use it, be really careful to prepare your database a little bit. Because, if you just try to use it on your whole database without having an index you’ll end up with queries that take really long times to run.
You should read the documentation before using this function and take care to create indexes on the table fields you want to query.
Also consider the size of your database as an important data. In our case, with more than 40.000 articles, our first try to query with similarity. would make a JOIN that produced 40.000² rows (which makes 1.6 billion rows) to analyse, because it compared every row with one another. Don’t do that.
Instead, you should try a similar approach as the one described below, to reduce computation time to a reasonable time.
- Reduce the number of rows resulting of the join operation.
In our case we chose to compare articles published with less than three days of difference. This offset can be changed, either to bigger the number of potential duplicates, or to accelerate the query execution by reducing the number of joined rows.
- Reduce the size of the texts to compare.
The query execution can also be accelerated by reducing the size of text to compare to only a parts of the documents (for us the first 1000 characters). Also, like with Grouping, we don’t need the punctuation or case to detect similarities. On the contrary, ignoring those will improve similarity detection since the similarity function only cares about characters, not words.
The code of this solution can be found in this gist. It’s far from perfect but since it was a one-time solution, it was good enough for us.
Is there any other way of detecting duplicates/similarities?
We know that we didn’t investigate all existing techniques or tools to solve this issue. For instance it may be possible to use Sci-Kit learn as described in this Stack Overflow answer to detect similarities between documents then deal with them. That said, I’m not convinced this approach is less complex or more efficient than our solution, that’s why we did not study it further.
However we stay curious about the other possible ways of dealing with this issue that we didn’t see. So please, if you already were confronted to this issue and you know “cleaner”, quicker or more generic ways (as far as I know, this only works on PostgreSQL), let us know!