Ensuring Quality — Copyright
In continuing with our discussion of data science works in Pratilipi mentioned in the previous article here. Today I will be covering one of the topics that deals with the quality aspect of our platform.
Now, Pratilipi is a 2 sided user generated content (UGC) platform. And contents being user generated, we face a 2 sided challenge -
- How to show the best quality content to the relevant readers and also
- To remove contents that are unwanted in our platform.
Of the several processes we do to remove contents that are unwanted in our platform, one of them is — Content Copyright Check.
Authors put a lot of effort to come up with high quality content. For the contents they publish on Pratilipi, authors hold the intellectual property rights, unless they have agreed to share with some other party. We take great care in making our platform secure so that no content is getting copied from the platform. But despite that whenever a content gets copied from anywhere and published in Pratilipi or anywhere else, it is hurtful to authors and also the act itself is illegal.
We understand the pain that authors feel when they find their content circulating without acknowledging them. So from very early on, we had a strict policy of removing the contents which were identified as a copyright violation.
Whenever a content gets published on our platform, we compare it to all the contents that have been published in the past and give out a matching score. If the score crosses a particular threshold value, we flag the content and raise it to our Language team, which then takes appropriate action on them.
For matching the content with previously published contents, the problem is no easy feat. We have, to the date of writing, around 8,000 content getting published on our platform daily and around 3M+ contents that are already published.
Contents on Pratilipi range from a short poem to a 1000+ page novel. So comparing them at this scale has significant complexity for us. So we were looking for algorithms that are fast as well as robust.
After exploring some algorithms and tools, we decided to go ahead with Myer’s diff algorithm. It is a divide and conquer algorithm that aims to find the middle path and divides the problem and then recursively solves them. One of the other algorithms considered was the advanced Histogram algorithm.
On top of it, we do some NLP based transformations such as converting to standard unicode characters, handling language based union of characters, handling extra characters, and some more to increase the likelihood of our system detecting such plagiarised contents.
Overall, this process runs for around 5 hrs a day with 36 threads running simultaneously that churns out plagiarised contents.
This is not to say that our platform is 100% plagiarism free and despite our best efforts few contents might get slipped. We continuously look for better algorithms and approaches that help us improve our performance.
Also, we try to educate our users the severity of this problem and always encourage them to do the right thing. We have also seen users who did plagiarised unknowingly and then transformed themselves to become one of the popular authors on the platform. Such journeys inspire us.
If you are a data scientist or an aspiring data scientist who find these problems interesting, or even if you just want to have a chat about any of these use cases, please feel free to reach out to me at sunny@pratilipi.com