Week 2 — Success of AI Writers

Ahmet Emre Usta
AIN311 Fall 2022 Projects
5 min readDec 2, 2022

We briefly discussed our project and objectives last week. Visit the link below to read it.

This week we will talk about plagiarism and the dataset we will use.

Definition of Plagiarism

As previously stated, plagiarism is the act of using someone else’s ideas and work without giving due credit. In order to solve a problem logically, one must first fully comprehend it. In this blog post, we will discuss plagiarism in greater detail. There are two sorts of plagiarism.

Photo by Markus Winkler on Unsplash

- Textual Plagiarism

Plagiarism in the text is a frequent subject in education. For instance, scholarly research and student assignments are both checked for textual plagiarism.

There are also types of textual plagiarism:

  • Copy paste plagiarism
  • Paraphrasing
  • Self plagiarism
  • Idea plagiarism
  • Translated plagiarism

- Source Code Plagiarism

Use of another person’s code without attribution is regarded as plagiarism, just as textual plagiarism. Plagiarism in programming involves more than just copying existing code. This sort of plagiarism also includes using the algorithm hidden in the code and the comments without citing them in his work.

Plagiarism Detection

The main goal of plagiarism detection is to identify the areas in which two documents have the same information. The process of finding plagiarism can be done manually or automatically. Because every reader will interpret a piece of writing differently, manual plagiarism detection is difficult. Additionally, as the number of information increases, it becomes more difficult to tell apart connected parts of the text. To manually detect plagiarism, teachers and instructors must read every assignment and be knowledgeable about all possible sources of plagiarism.

Photo by Scott Graham on Unsplash

Computerized plagiarism detection systems have advanced, making it easier and quicker to identify instances of plagiarism. By allowing for the comparison of several documents, an automatic plagiarism detection system aims to reduce the time needed for text comparisons. This makes it possible and straightforward to search through a large number of electronic sources for writings that might be linked. Using an automated plagiarism detection system should help to decrease the number of occurrences that are incorrectly classified as plagiarism and cases that are incorrectly classified as non-plagiarized.

Similarity Metrics

Analyzing similarity is the fundamental step in the detection of plagiarism. The similarity between two documents or text fragments must be identified and calculated in this context.

Applications for plagiarism detection are utilized to determine the similarity metric. Their primary responsibility is to find the similarity ratios in cases where plagiarism is disputed, not to assess whether there is plagiarism. People decide whether or not something is plagiarism because it is a legal offense.

There are numerous methods for calculating the similarity ratio.

A few examples:

  • Character-Based Methods
  • Syntactic-Based Methods
  • Semantic-Based Methods
  • Grammar-Based Methods
  • Citation-Based Methods
  • Vector-Based Methods
  • Fuzzy-Based Methods
  • Cross-Lingual Methods

Many preferred application methods used in the detection of plagiarism currently work with one of the methods listed above. We decided that using the semantic-based method would be the best option because we will be studying sentences that have been paraphrased with AI writers.

Semantic-based methods focus on a document’s semantic representation and identify paraphrasing that has the same meaning as the original text.

We will take the method we will use in more detail in the next article. Let’s talk about the data set we’ll be using now. After all, the data structure is an artificial intelligence project.

The Stanford Natural Language Inference (SNLI) Corpus

According to SNLI website,

Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is the task of determining the inference relation between two (short, ordered) texts:

entailment, contradiction, or neutral

The Stanford Natural Language Processing Group

In 2015, a group of researchers from The Stanford NLP Group published the first version of the dataset they created called SNLI.

The human-written data set, consisting of 570 thousand English sentence pairs, were manually labeled and made publicly available for the development of NLP (Natural Language Processing) models to solve the RTE problem.

For more detailed information about the dataset creation process, you check out the A large annotated corpus for learning natural language inference”[1] paper.

To talk about the numbers about the data set, it is published by the study group in 3 parts as train, validation and testing.

Table of Dataset Numbers

Each sentence has 3 equivalents to be entailment, contradiction, neutral.

Pie Chart Distribution
Pie chart representation of the distribution of the SNLI dataset

For a better understanding, you can examine the table below.

We tried to explain plagiarism, its detection methods, and how it is detected in this blog post. We gave spoilers that the similarity method we will use later is semantic-based. We’ll go over the method in more detail in the following blog post, but our curious readers will figure it out when they look at the dataset.

Until next blog post

Take Care

Ahmet Emre Usta

Hüseyin Yiğit Ülker

References

[1] — Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [pdf] [bib]

Kadir Yalcin, Ilyas Cicekli, Gonenc Ercan, An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding,
Expert Systems with Applications, Volume 197, 2022, 116677, ISSN 0957–4174, https://doi.org/10.1016/j.eswa.2022.116677.

The Detection is in the Details | Turnitin

--

--