Part 1-Sinhala Language based Plagiarism Detection in Natural Language Processing

Erandi Ganepola
Apr 13, 2017 · 4 min read

What is plagiarism detection?

Industrial Revolution has brought through industrialization, to an economy based Information Age. In this digital age plagiarism has turned into a serious problem. Lancaster and Culwin has stated in their paper that “plagiarism as theft of intellectual property which has been around as long as human has produced work of art and research”. Basically it is what you try to present someone else’s work as your own work without referencing to the original source.

Some of the common practice methods in plagiarism are copy-pasting textual information, using program codes without permission or reference, using similar ideas which are not common knowledge, content translation and use without reference to original work, etc. Plagiarism can be reduced either from “plagiarism prevention or plagiarism detection”. Common ways for plagiarism prevention are honesty policies and/or punishment systems, and plagiarism detection are software tools to reveal plagiarism automatically. In this article an overview for automatic plagiarism detection will be discussed.

Related technologies to address plagiarism detection

Plagiarism checking software is used as the most popular way for detection. They actually detect sections of identical text. Those software works by looking for structural patterns or unique identifiers.

The automated process is very similar to Natural Language Processing, visual identification, and biometric matching. All of these have a foundation in pattern recognition. Also detection software identifies matches within its database, but that doesn't mean it's plagiarism. A person needs to look at that match, see if it is text that is a quote, excerpt or other source/reference, then decide whether it is plagiarism or not.

This paragraph explains some technologies currently being used. Further information can be found here. “” and “” were created to address the growing problem of web-based plagiarism. Also “Glatt Plagiarism Services” offers a user-end software-based approach to preventing and detecting plagiarism. One software program, called the “Glatt Plagiarism Teaching Program (GPTeach)”, is a tutorial designed to provide students with an understanding of exactly what constitutes plagiarism and instructions on how to avoid it. “Integrigard” offers a subscription service and a free service through “” and “”. There are much more exists and much more to be released in future.

In most of those tools, underground algoritms are based on Natural Language Processing.

Sinhala Language based Plagiarism Detection

Sinhala language is inherently recognized as the main official language of Sri Lanka and it is used by over 19 million people. It has developed into its current form over a long period of time with influences from a wide variety of languages including Tamil, Portuguese and English.

At the moment there aren’t any Sinhala language specific plagiarism detection tool available. Even though there are language independent tools that supports many languages like English, Hindi, Sinhala, etc., they aren’t providing satisfactory results. Reason behind this is language independent tools rely on common patterns rather than language specific mechanisms. Hence need of implementing a Sinhala Language specific Plagiarism Detection tool is emerging.

Relationship between NLP and ML

Natural Language Processing (NLP) refers to Artificial Intelligence (AI) method of communicating with intelligent systems using a natural language such as English. “Speech and written text” are the inputs and outputs of NLP systems. Machine learning (ML) is a part of AI where the algorithms learn on (usually big) data. It subdivides into classification, regression, clustering and other disciplines. Natural language processing can use Machine Learning but it can also be engineered by hand. But using ML (in a correct way) can help to boost the performance of NLP systems. NLP is an applied field of ML. Many ML algorithms and techniques like HMM, Gradient descent algorithm, Artificial Neural Network, supervised/unsupervised learning are being used in NLP very frequently.

NLP & ML based plagiarism detection

Both NLP techniques and advanced NLP techniques can be applied during pre-processing stages such as Sentence segmentation, Stop-word removal, Punctuation removal, Synonym replacement, Stemming, etc. for identifying plagiarized texts. Further information can be found here.

Most of the Natural Language Processing methodologies use Machine Learning inside, especially to train a model to classify the documents according to the plagiarism levels. These areas are still under research, but undergoing in a highly satisfactory level.

Limitations of Plagiarism Detection

Current plagiarism detection tools are mostly limited to comparisons of suspicious plagiarized texts and potential original texts at string level. Most automatic plagiarism detection tools are weak in two areas.

1) Non-Verbatim Plagiarism: Plagiarism that involves the rewriting, translating or otherwise redrafting the text can’t be detected properly.

2) Common Phrasing/Attributed Use: Though many plagiarism checkers will make an attempt to separate out attributed use, given the variety of attribution styles it isn’t always possible. Some common phrases are in the English language, many plagiarism checkers will report matches that are actually just coincidence.

These areas are still being researched for improvements. There’s always the hope that these weaknesses can be addressed from ML and NLP concepts in near future.

Erandi Ganepola

Written by

Software Engineer@WSO2 | Open-source Contributor | Basketball Enthusiast