Human Learning Journal: AI-generated posts and plagiarism

Published in

impactIA

8 min readDec 3, 2020

This weekly journal reports my up and downs as a machine learning and robotics intern at the impactIA Foundation. Try to have fun reading it, but know that I have little fun writing it. You have been warned.

Today I want to talk about writing. As I say in the preface of my posts since more than two months, I don’t really enjoy writing. However, last week, while mechanically reading my Google Feed, I came across an intriguing AI called AI Writer. Its purpose is simple: write content for you. Blog posts, sport article, product review, it can write anything. Could this AI be the solution I have always been looking for ?

Very excited by this discovery, I immediately tried it. I created a free account and followed the easy steps mentioned below: I provided an article title, waited a few minutes and… it worked ! Eight hundred words on “The rise of artificial intelligence in Geneva”, written while waiting for my coffee to be ready. Amazing, right ? The best part is that, this content is not a compilation of texts scraped here and there on other websites, it is 100% original content !

…Or at least 93% original content. In fact, in addition to the article generation, AI Writer compares the resulting text with all the sources it used to generate it, and gives a similitude percentage. I also have access to all these sources, to verify where the information comes from. So far, this AI seems unbelievable and you must think that there is no reason not to use it every week to write my posts, right ? In fact, who wrote this post ? AI Writer or myself ?

Rest assured, you are not reading an AI generated text and there is an easy way of finding it. Take a look at the text I got using AI Writer :

The Rise Of Artificial Intelligence In Geneva

Disclaimer: This article is an example of AI based text generation by AI Writer. According to AI Writer, it is 93%…

v-kindschi.medium.com

Independently, each paragraph is okay-ish, but as there are almost no semantic links between the paragraphs, the article as a whole is not very enjoyable to read. It is more like a compilation of independent paragraphs talking about AI. More precisely, I have the feeling that AI Writer is not writing a whole new article using solely its knowledge of the world and a headline. Here is how I think it works:

First, AI Writer looks online for around twenty sources based on the headline I gave, in my case : “The rise of artificial intelligence in Geneva”.
Then, it randomly selects some of the sources, and use them to slightly fine-tune a transformer model.
The fine-tuned model is used to write a single paragraph, of random length.
Finally, the previous two steps are repeated until the text is long enough.

If I’m correct with this procedure, the fact that the text is not written as a whole, but paragraph by paragraph might explain the lack of overall structure and links. In the end, this tool is nice, but incomplete. It lacks customization tools and the resulting text is, according to me, below average and I would not read such article until the end.

However, what is available out of the box on the web does not reflect the actual capabilities of AIs. Most AI models need to be tuned to fit a specific usecase. One of the best (or the best) text generation AIs is called GPT and was created by OpenAI. Their latest release GPT-3 is so much better than what I’ve shown you before that they decided not to make it free, to be able to review the use of their technology. Moreover, only a small version of the previous model GPT-2 was released, to avoid giving a too powerful tool openly on the internet.

We have a mandatory production review process before proposed applications can go live. In production reviews, we evaluate applications across a few axes, asking questions like: Is this a currently supported use case?, How open-ended is the application?, How risky is the application?, How do you plan to address potential misuse?, and Who are the end users of your application?.
— OpenAI GPT-3 API terms

There are plenty of example of GPT3 online (just Google it). One of my favorite is called AI Dungeon and it allows you to create a character and live a randomized fiction with different themes: explore worlds, fantasy, mystery, zombies, … Try it, it’s very funny, and sometimes surprising ! I also read about a novelist using AI to help her writing, which I find very cool. She understood that AIs must be used as tools to work with, and not enslaved to do all the work. I really hope that more and more people will start to realize that AIs are not meant to replace us, but rather to empower us to do more.

Nevertheless, not everything is good with AI. I want to come back to the danger of AIs, specifically in automated text generation. We can see that OpenAI is well aware of the potential misuse of its technology. Unfortunately, not all companies have the same concerns. I am particularly worried about tools used to plagiarize or flood the internet with fake news or useless content.

The example of AI Writer that I used before seemed honest, providing a uniqueness measurement when generating texts. However, AI Writer comes with another tool available that worries me more. It is called “Text rewording” and does exactly what you think: it rewrites any text you provide, potentially making you able to steal content, tricking plagiarism verifications. I tried it with my last blog about graphics cards and AI and fortunately it is not working so well… yet. The first paragraph was reworded from this:

Thanks to my work at impactIA, I was recently able to afford a new graphic card for my gaming computer. I can now play any recent game with the best graphics settings, taunt my friends, but more importantly train new AIs very quickly ! But how does it work ? How did the gaming industry participate in the rise of machine learning ?

To this:

Machine learning and artificial intelligence in general have grown exponentially in recent years. How does this work and how does the gaming industry participate in the rise of machine learning? You can now mock your friends and play current games with the best graphics and settings, but more importantly, you can practice a new AI very quickly.

Because some words are changed, the result suffers from inconsistency between paragraphs, as well as strange wording and changes in the meaning of some sentences. Sadly, I am certain that it will be possible to correctly reword any text online in a near future. I just hope that plagiarism verification tools will have the time to get up to speed and become able to detect AI rewording.

In addition to the plagiarism on internet, my second concern is the increasing amount of unnecessary or duplicate pages. This is in part due to the way of working of Google. To be better referenced i.e. on the firsts results of many Google searches, websites are pushed to upload new content regularly, because creating new pages will increase the number of keywords that might be searched by users.

Why Fresh Content is Critical For SEO

This phrase was actually coined by Bill Gates in 1996, in an with the same title. Gates predicted that content would be…

seodigitalgroup.com

Therefore, more and more businesses want their website to be updated often, and to do so they have two solutions: hire a writer to create content or use softwares that can write content for them. However, as we saw, the most recent AI text generation tools are not working well yet. Then, how it is possible to create fresh content automatically without AI or without hiring someone ?

Well, it isn’t. The current solution is to steal content from other websites. The problem doing so is that Google detects plagiarism and will flag these new pages as “duplicate content”. However, clever web developers found a way to overcome this verification. The trick is to translate the content twice: once from the original language to a language grammatically not so far and then back to the original language. Depending on the translator used, the resulting text will be slightly different and is usually not detected as duplicate by Google. Today, using AI based translators like DeepL, it is very easy to have a premium quality translation and it costs close to nothing.

The most frightening thing is that tools to automate copy-translation-publish have existed for almost ten years ! For instance WordPressP Automatic Plugin, which was created in 2012, allows you to scrap WordPress websites, copy some paragraphs, translate them twice, (cite your sources), and publish seamlessly. You can even set the frequency of new posts from every other week to every 60 minutes. And yes, with WordPress Automatic Plugin, citing the sources is optional. You can also remove links from the original posts, replace them with yours and even reword the article (or “spin the post”, as they call it) before publishing it. Everything is made to make it easy and everything is automatic.

This plugin is rated 4.8/5 stars, has been sold 22'500 times, is legal and approved by WordPress. Apparently no one is bothered by the fact that it just makes undetected plagiarism easily available for anyone. Welcome to the internet.

To conclude, I’ve shown you what could be done today with AI-based text generators. Keep in mind that there is still a gap between what can be done out of the box with pretrained models and what can be done with fine-tuned, well polished models. Nevertheless, this technology can be very toxic for the future of the internet and we will need more tools to navigate between the AI bots and regulations to protect intellectual property online. Web content plagiarism exists since the beginning of the internet and becomes more and more difficult to track. We already know that we cannot trust everything we read online, but now more than ever we must question the authenticity of the sources too:

Note: The audio quality demonstrated here was additionally degraded since we want to avoid improper use of this technology. The purpose of this video is to excite the class about the potential of deep learning, not to deceive anyone. Thus, we purposely lowered the audio quality before publishing to make the synthetic aspect of this video clearer. — Alexander Amini

Human Learning Journal: AI-generated posts and plagiarism

The Rise Of Artificial Intelligence In Geneva

Disclaimer: This article is an example of AI based text generation by AI Writer. According to AI Writer, it is 93%…

Why Fresh Content is Critical For SEO

This phrase was actually coined by Bill Gates in 1996, in an with the same title. Gates predicted that content would be…

Published in impactIA

Written by Valentin Kindschi