Stopping fake research papers from getting published using machine learning

Aaron Edell
3 min readMar 2, 2018

--

Figure 1 from a published research paper

“In 45 minutes, I was able to detect 10 times more fake articles than I was previously able to detect in a year”. This is an actual quote from a Ph.D. Candidate computer scientist named Jeffrey Gordon who used Classificationbox to train a model to detect fake research papers in research journals.

Fake Research

There is a massive fake research problem in academia right now.

There are cases where human-made articles recapping movies or a particularly horrific episode of Star Trek: Voyager are being published in multiple journals. There’s even an example of a published article that consists entirely of the same 7 words over and over again.

An actual published paper

There are predatory journals that “take advantage of inexperienced researchers under pressure to publish their work in any outlet that seems superficially legitimate.” A Finnish study found that, between 2010 and 2014, the number of articles published by predatory journals grew from fifty-three thousand to almost half a million.

And worse yet, seemingly valid articles can now be generated with a machine. (One such tool is called SCIgen) They are utter nonsense, and yet still get published. In 2013, the IEEE pulled 120 papers off their publication because they were found to be computer-generated.

A SCIgen paper that was published in a ‘peer-reviewed’ journal

With so many predatory journals and fake research papers, academic institutions, journalists, and researchers themselves are having a hard time sorting through the noise. It erodes trust in the scientific process, can lead to fake or misleading news, makes legitimate scientific research more difficult, and “allows unqualified researchers to build their resumes.

Solve it

A Machine Box customer recently decided to train a model to detect these papers in a corpus of research journals to see if he might be able to combat the problem.

“My colleagues and I have been trying to solve this for years” jeffrey told me. “With Classificationbox I was able to solve it in three hours”.

He sourced 1000 example articles for this experiment. He took 500 examples of machine-generated, fake research articles, and 500 examples of genuine ones and showed them to Classificationbox. After creating the model, he then wrote a script to run 1.5 million unknown articles through the model to see if it could accurately detect the fake articles.

“It correctly identified all 600+ SCIgen articles in about 45 minutes!”

Performing a laborious task that would take a human a prohibitively long time to complete is a perfect use of machine learning.

What is Machine Box?

Machine Box puts state of the art machine learning capabilities into Docker containers so developers like you can easily incorporate natural language processing, facial detection, object recognition, etc. into your own apps very quickly.

The boxes are built for scale, so when your app really takes off just add more boxes horizontally, to infinity and beyond. Oh, and it’s way cheaper than any of the cloud services (and they might be better)… and your data doesn’t leave your infrastructure.

Have a play and let us know what you think.

--

--

Aaron Edell

Co-founder Machine Box (exited)| Entrepreneur | Business Development at Amazon | Agile Product Owner | Author | Father | Amateur Programmer | opinions are mine