What is Plagiarism detection?

Published in

Analytics Vidhya

5 min readOct 4, 2018

Due to the ever-increasing electronic content and easy access to the world-wide-web, plagiarism in academic, research, journalism, and literature has become a major issue. But do you know what is plagiarism and how to prevent or detect it? if you’re a university student or a content writer this article will be useful to you.

What is Plagiarism?

Mind map for Plagiarism — Graph by author

Actually, it’s very hard to give an extract definition for the word plagiarism but According to Merriam-Webster dictionary, the simple meaning for plagiarism is “To use the words or ideas of another person as if they were your own words or ideas”. Plagiarism also includes:

Turning in someone Else’s work as your own.
Copying words or ideas from someone else without giving credit.
Failing to put quotations in quotation marks.
Giving incorrect information about the source of the quotation.
Changing words but copying the sentence structure of a source without giving credit.
Copying so many words or ideas from a source that it makes up the majority of your work, whether you give credit or not.

There are two main classes of methods used to reduce plagiarism.

Plagiarism Prevention :
Punishment routines and plagiarism drawback explanation procedures. Require a long time to implement. But have a long-term positive effect.
Plagiarism Detection :
Include manual methods and software tool. Easy to implement, but have a momentary positive effect.

Plagiarism Detection

Plagiarism detection can be done manually or using an automated process. The automated process is very similar to natural language processing, visual identification, and biometric process. All of these have a foundation for pattern recognition. Automated process doesn’t give 100% accuracy. so the manual checking is still needed.

Internal Plagiarism Detection

Finding plagiarized passages within a document without access to potential original text. Also called Intrinsic plagiarism detection.

External Plagiarism Detection

External plagiarism detection consists of comparing suspicious plagiarized documents against potential original documents.

Plagiarism Detection in source code

Detecting Plagiarism in source code is relatively easy than natural language plagiarism detection. Because there is neither ambiguity nor interference between words in programming languages. But in natural language, every word may have many synonyms and different meanings. Some plagiarism detection methods are language independent and some are language-dependent.

Plagiarism Detection in natural language

Detecting plagiarism in written documents. this method can divide into two categories which are called language-independent plagiarism detection and language-dependent plagiarism detection.

Language-Independent Plagiarism Detection

Language independent methods are based on evaluating text characteristics that are common to all languages. Such as the number of special characters and the average length of a sentence. Paraphrasing techniques can be used to mislead the language-independent systems.

Language Dependent Plagiarism Detection

These methods are based on evaluating text characteristics that are specific to one language. Such as counting the frequency of a special word in a particular language. Language dependent plagiarism detection is more effective than language-independent plagiarism detection.

Stylometry — based methods

Stylometry is a statistical approach used for authorship attribution. These are inspired by authorship attribution methods and consist basically of classifying writing styles of authors to identify similarity. It is based on the assumption that every author has a unique style. The writing style can be analyzed by using factors within the same document, or by comparing two documents of the same author. This is performed by dividing the documents into parts like paragraphs and sentences. The style features are then extracted and analyzed. The main linguistic stylometric features are Text statistics which operate at the character level (number of commas, question marks, word lengths, etc).

Syntactic features to measure writing style at the sentence level (sentence lengths, use of function words, etc.).
Syntactic features to measure writing style at the sentence level (sentence lengths, use of function words, etc.).
Closed-class word sets to count special words (number of stop words, foreign words, “difficult” words, etc.).
Structural features that reflect text organization (paragraph lengths, chapter lengths, etc.).
Using these features formulas can be derived to identify the writing style of an author. Stylometry-based methods can be used in internal and external plagiarism detection.

Content-Based methods

Analyzing specifications of texts in terms of logical structure and discover similarities. Content-based methods can be used only in external plagiarism detection.

Fingerprinting technique

The fingerprint is a set of integers created by hashing subsets of a document represent its key content. The method consists to measure the similarity of two documents by comparing their fingerprints. Techniques to generate fingerprints are mainly based on k-grams (a k-gram is a contiguous substring of length k) which serve as a basis for most fingerprint methods.

Latent Semantic Analysis (LSA)

In this technique, words that are close in meaning are assumed to occur close together. A matrix is constructed in which rows represent words, and columns represent documents. Every document contains only a subset of all words. Singular Value Decomposition (SVD), a factorization method of a real or complex matrix, is used to reduce the number of columns while preserving the similarity structure among rows. This decomposition is time-consuming because of the sparseness of the matrix. Words are compared by taking the cosine of the angle between the two vectors formed by any two rows. Values close to 1 represent very similar words, while values close to 0 represent very dissimilar words.

Stanford Copy Analysis Mechanism (SCAM)

This is based on a registration copy detection scheme. Documents are registered in a repository and then compared with the pre-registered documents. The architecture of the copy detection server consists of a repository and a chunker. The chunking of a document breaks up a document into sentences, words, or overlapping sentences. Documents are chunked before being registered. A new document must be chunked to the same unit before comparing it with pre-registered documents. Inverted index storage is used for sorting chunks of registered documents. Each entry of the chunk is a pointer to the documents in which that chunk occurs (posting). Each posting has two parts: document name and its related chunk occurrence number. A small unit of chunk increases the probability of finding similarities between documents. The chunk unit in SCAM is a word. Documents are compared using the Relative Frequency Model (RFM) which consists mainly of computing a set of words that occur with the same frequency in two documents.

Natural Language Processing and Machine Learning for PL detection

NLP is using in pre-processing stages such as Sentence segmentation, Tokenisation, Stop-word removal, Punctuation removal, Synonym replacement, Stemming, Number Replacement, Punctuation removal, etc. for identifying plagiarized texts. These pre-processing techniques improve the accuracy and efficiency of the plagiarism detection algorithm. And also can address plagiarism detection through a machine learning approach in an effective way. there are some undergoing researches to do this task using ML and neural networks and deep learning.

Popular software tools for plagiarism detection.

The detection of plagiarism is not a new research area. Various approaches have been developed to deal with source code and natural language plagiarism detection.

plagiarism.org and turnitin.com are popular tools to address web-based plagiarism. Glatt Plagiarism Services, Inc. offers a user-end software-based approach to preventing and detecting plagiarism. More details about these technologies you can find here.

There are a number of software tools available for plagiarism detection but most of them are not popular because of the less accuracy of them. the methods used for plagiarism detection so far limited to a very superficial level. So the plagiarism detection technologies still need to grow.

References:

https://www.researchgate.net/publication/272853366_Detection_of_Plagiarism_in_Arabic_Documents

https://www.researchgate.net/publication/242783426_Using_Natural_Language_Processing_for_Automatic_Detection_of_Plagiarism

https://cs.stanford.edu/people/eroberts/cs201/projects/honor-code/tech.htm