Large Language Models (e.g., GPT-3) and the problem of plagiarism — one academic librarian’s musing
I have been playing around and trying to learn by reading papers about Large Language Models (LLM) since 2020 when OpenAI’s GPT-3 was released. My interest has mostly been on the potential of the state of art capabilities natural language processing that LLMs brings to improve discovery, particularly in the area of extracting answers from documents
But I would be blind not to realize that the part of LLM that causes the most excitement is its ability to generate text! After all, my very first use was to use GPT-3 to produce a blog post!
In this long post, I first talk about how things have changed with the launch of ChatGPT (mainstream awareness of LLMs is here or near that), which has led to a host of situations for misuse - students using them to generate essays for assignments is the use case that educators who assign essays as homework are worried about.
I then do a quick review of possible technological solutions, from GPT detectors to watermarking initiatives by OpenAI.
I then consider the objection that LLMs do not cite sources and hence cannot be used for student assignments. I show examples where I try to coax chatGPT to produce essays with proper references using prompt engineering and show that for ChatGPT at least while it can produce very plausible references they are often made up.
I then show that while ChatGPT itself does not currently do too well in producing answers with references, this is not a fundamental issue. I discuss recent research work from Deep Mind’s LaMDA which links the LLM to a knowledge base to query for answers to OpenAI’s WebGPT where the LLM is hooked up a web search engine (Bing) API and trained to search the web and construct answers by citing these websites like any human.
I then turned to academic search specific use of LLMs in tools like Elicit.org and Scispace and show that use of LLMs in such tools are likely to be less objectionable from view point of plagiarism as they don’t generate essays so much as help researchers read , extract data and compare papers.
That said, even these tools are starting to generate reviews, which leads to a host of tricky legal and ethical questions.
Use of LLMs reaches mainstream — thanks to ChatGPT
Since 2020, interest in LLMs has continued to rise, with other big tech players like Google/DeepMind , Meta jumping onto the band-wagon of creating ever larger and sophisticated Large Scale Language models.
While the computer science community and more techie people are certainly aware of the developments in this space, it did not really reach mainstream. Sure, people might have read new stories on how a Google engineer thought Google’s never publicly released LaMDA (Language Model for Dialogue Applications) was sentient, but it wasn’t something that was real for most. Many may not even have realized with GPT-3 or its cousins you didn’t need any technical skill to use, just do prompt engineering…
It was the launch of OpenAI’s ChatGPT earlier this month that helped push it to near mainstream. It was not just more mainstream press, but the clearest signal is that knowledge of such systems has made the jump to TikTok which is now full of posts raving about it.
So why did it take 2 years for this to happen?
Firstly, while ChatGPT is considered only a GPT-3.5 model (the highly anticipated GPT4 is believed to be scheduled for release early 2023), it incorporates a host of improvements over the original GPT3, chief of which it uses Reinforcement Learning from Human Feedback (RLHF) where the LLM is trained to better align with what human raters consider as good answers.
And as someone who has played with GPT-3 and seen the steady improvements in the various models, such as the various InstructGPT models, the latest ChatGPT does seem to be definitely better than the earliest GPT-3 models released due to such improvements.
But that can’t be the full story since OpenAI has released a number of new improved models over the last two years, in particular text-davinci-003 was released just days before ChatGPT without much reaction. While the two aren’t fully comparable since one was trained to be a chatbot, their capabilities are broadly similar for most uses.
The simplest answer is OpenAI’s decision to just open ChatGPT free with immediate access to anyone who bothers to register is what made the difference. This is in stark contrast to GPT-3 where on release in 2020, there was a wait list to gain access (only removed after Nov 2021). More importantly, while ChatGPT is free at the time of writing, the other GPT models are chargeable even to just test in the playground.
It is now literally in the hands of the masses, with no barrier to entry.
Widespread use leads to far of consequences
When OpenAI announced GPT-2, the less capable predecessor to GPT-3 in Feb 2019, they initially refused to release their largest most capable model for fear that it was misused but they eventually decided to release it.
Now they have the much more capable ChatGPT available for the masses. To be fair they have created various guard rails so that prompts that are deemed to be inappropriate will be refused but as expected people have found many creative prompts to bypass it.
A lot of this filtering is to prevent people asking it for advice that leads to illegal, violent, or unethical actions. However, at the core of it, it is basically capable of doing exceptionally good text completion and it can write a surprisingly good answer to any question.
This leads to a problem — what if widespread use of this technology leads to students using this to do essay assignments? While this is hardly the most dangerous use of LLMs, this is an area academic librarians care about.
Does widespread access to LLM = Death of college undergraduate essays?
Almost immediately after the release of ChatGPT, I saw editorials and articles like the Altantic’s The College Essay Is Dead. There is similar coverage on higher education sites like Chronicle of higher education, Insider Higher Edu and of course the tech sites. While there was an occasional piece in the last two years on LLMs, the release of ChatGPT perhaps prompted by the realization awareness and use has gone main-stream has worried educators.
On my part of twitter, I also see so many educators worrying about this issue. Some have tried it and are worried because they think the essays are at a level it can get a B or C , while others are intending to switch the way they assess.
While supporting freshman academic writing isn’t my specialty, I am of course aware that many academic librarians play roles in these areas and help support the cause of academic integrity (cite your references! Don’t plagiarize!)
Another area of concern to educators I suppose is educators in the Computer Science disciplines. Where LLMs trained to do code completion like Github Copilot are causing waves in the industry with class action suites challenging the right of LLMs to use the code on the web for training.
Can Technological Solutions save the day?
The first thing to note is that the usual convention solutions like Turnitin which check for exact phrase matches is not going to work. Large Language Models particularly ChatGPT while being trained on huge chunks of the public web tend to produce text sequences that are original and will not be flagged by the usual anti-plagiarism software.
I’ve heard that some students run the essays generated by ChatGPT via Paraphrasing tools like Quillbot to play safe but not only is that not necessary, ChatGPT can nicely paraphrase text if you ask it.
Are there detectors to detect AI generated text? Indeed, there are GPT-2 detectors but they will not work on the newer GPT models.
In the past, I’ve also played around with some of the tools in there that detect “neural fake news” such as Giant Language model Test Room or GLTR or Allen Institute for AI’s Grover
Strictly speaking, these tools are designed to detect neural or neutral net/AI generated fake news, not all AI generated writing is fake news and not all human generated writing isn’t fake news!
OpenAI themselves are currently working on a way to watermark text generated from their LLMs. The idea here is the algo could be tweaked in such a way that while the text generated looks normal to humans, there is some signature in the way it chooses its words, that when you throw the text generated into a detector it can recognize this signature (pretty much a cryptographic signature). On paper this solves the issue for educators assigning the essay as an assignment.
I can think of several problems with this. The most important one is that even if OpenAI is responsible enough to build this in, not all actors will do so.
The technology to create LLM isn’t exactly secret, the main barrier is data and computing power. In fact, OpenSource LLMs already exist.
Consider the case of OpenAI’s equally amazing text to image generator DALLE2 which has safeguards against producing certain types of images deemed deceptive or otherwise harmful. However, it was all moot when stablediffusion was made opensource and everyone could run their own copy and remove any safe-guards if there were any. People are already using Stable Diffusion which is opensource to generate porn and the problem of deep fakes is going to be much more serious if all it takes to generate one is to type in a text prompt!
Objection! But ChatGPT doesn’t do good referencing!
Of course, ChatGPT which most people are using has an obvious flaw. It generates a pretty good essay but does it do referencing properly? I spent an hour trying to find the right prompt to do so, initially I failed. But asking followup questions did generate some references.
Changing the prompt to ask it to include citations or references worked better
In one example (not shown), it got the reference completely right (and a real relevant paper I had in mind) but usually it would have some subtle error.
Cuddy, A. J. C., Fosse, N. E., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21(10), 1363–1368.
This is the paper I was looking for, the journal, vol/issue, year and page number is correct, but I’m not sure why there is a extra author. As far as I can tell no preprint lists this extra author.
Gargouri, Y., Hajjem, C., Larivière, V., Gingras, Y., & Brody, T. (2012). Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research. PLOS ONE, 7(6), e39621.
Another odd case, the paper exists and is relevant but it seems to have dropped two authors. The year is also wrongly stated as 2012 where it should be 2010. Interestingly, PLOS One,7(6) e39621 refers to a completely different paper but if you ask for the DOI for that paper it gives you the DOI of that wrong paper, even though chatGPT still thinks it’s the Gargouri paper.
Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C. S., Demleitner, M., & Murray, S. S. (2015). The effect of open access on citation rates. Journal of the Association for Information Science and Technology, 66(4), 661–678.
There is no such paper as far as I can tell. But the journal is real and might indeed be the type of journal you expect such a paper to be published in. Again the 66(4), 661–678 isn’t randomly made up, there is a real 2015 paper in JAIST in 66(4) with just about the same page numbers 662–678, it just isn’t that paper. The author isn’t purely random too, Kurtz M.J+ coauthors that has indeed published a paper on the same topic with the title “The effect of use and access on citations” in 2005.
See also “Can AI write scholarly articles for us? — An exploratory journey with ChatGPT” by a HKUST Librarian — which found similar results
The scary thing about this is because I have only a passing familiarity of the literature in this area (I have never published in this area, but have causally read a few papers over time in this area — there are tons I don’t have time to read all), the citations it generates looks very plausible (reasonable authors, journals, the year of pub and vol/issue years even sync!).
Can they explicitly add citations? What if we trained them only on academic papers?
I guess given how Language Model works it makes sense that they would be able to create citations that are very plausible given they have highly developed word associations for how citations look like, because their huge training set probably has citations from sources like Wikipedia. But at the end of the day, LLMs even with all the alignment work are just text predictors and can make stuff up.
Given that most LLMs are mostly trained on common public pages, do the results improve if we explicitly train them on only research paper?
What if we tried to make a GPT-3 for Science? In other words, a large scale LLM trained on academic papers and content.
I know of several attempts to do so (the most comprehensive is a 770M parameter ScholarBERT model trained on 75 million full text), but by far the largest LLM (in terms of parameters — 120 billion parameters) trained on academic papers and content is Meta’s Galactica. While it didn’t train on even close to all the papers available (only 48 million papers mostly title, abstract and some full-text from preprint servers as well as textbooks, reference works), its results were initially seen to be impressive.
As I read the paper describing Galactica, while I couldn’t follow all the details, I was impressed. It wasn’t that they just trained only on academic data but they took into account various special features in academic papers.
For example, they had features/tokens recognizing citations, math symbols (LaTeX Equations) and track specialized domain knowledge in areas such as chemistry/biology (DNA) etc.
This allowed Galactica to do many interesting prompts.
It could for example, recommend papers by using the following prompt
“A paper that shows open access citation advantage is large <startref>”
From what I could see it worked well for extremely specific claims see for example below where it finds the famous if controversial 2010 paper on power posing. I don’t think it was ever retracted though.
There are some concerns with regards to such recommendation systems where it tries to find papers to support claims you want as opposed to recommendations of papers based on general topics, keywords or similarity to other known papers, but that is a topic for another day.
It could explain or complete math or chemistry equations.
Lastly and most controversially, users were encouraged to use prompts to generate full literature reviews on any topic. Some of these were very good.
But even more people were worried. The system often would generate plausible and authoritative sounding text that were subtly or even totally wrong and the backlash I suspect led to the online web demo being taken down in just 3 days
In case you are wondering just the online web demo is down, but the model is still available for use for anyone. Someone even with my extremely limited level of skill is able to download it and run it in Google Colab with just a few lines of code!
LLMS combined with Knowledge Bases/Search engines.
There is an argument that regardless of what LLMs are trained for they are still mostly at its core doing word association and are what some called
But what if LLMs which also exhibit state of art capabilities to extract and summarise text are used instead to query databases/web and extract answers? And what if they were instructed to cite those papers?
One huge advantage of such a system is that training LLMs is extremely costly and yet without an external database to consult, they are at best limited to the data they were trained on. For example, as capable as ChatGPT is it is, it was released just before the World Cup Finale of 2022. As such there is no way it would know the answer. If you instead used it to look up a database, this problem could be easily solved.
Think AI systems being able to independently search sources and find relevant citations is a dream? Meta has already a system that can scan citations in Wikipedia
to identify Wikipedia citations that are unlikely to support their claims, and subsequently recommend better ones from the web … Using crowd-sourcing, we observe that for the top 10% most likely citations to be tagged as unverifiable by our system, humans prefer our system’s suggested alternatives compared to the originally cited reference 70% of the time…[Our system’s] recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claim….
OpenAI themselves have experimented with WebGPT where they trained a GPT model to learn how to search the web via a search engine — Bing API and generate an answer from the results returned with references included. The system was trained with human feedback and rated by other humans for the generated answers (which included references) on metrics like usefulness, coherence and factual accuracy.
There are a lot of interesting implications of the paper, but the upshot is the system produces answers that are preferred over other human given answers (the test set was partly obtained from Reddit’s ELI5 posts).
Some fascinating points in the paper include the discovery that when testing against TruefulQA which is a database of misleading and trick questions , question stance affects factual accuracy.
For example, if the question was phrased as “Why did the government fake the moon landing” (affirming stance), WebGPT would tend to have lower factual accuracy as opposed to questions phrased with a neutral (when did the moon landing happen?) or skeptical stance (Could the moon landing really be fake?). They speculate this could be the model learning to give answers it thinks the raters want to hear….
Because they built their system to evaluate factual accuracy based on citations they run into the usual problems any academic librarian is familiar with, how do we evaluate the authoritativeness of sources. They write
Our approach also raises a challenging problem with societal implications: how should factual accuracy be evaluated when training AI systems? Evans et al. [2021, Section 2] propose a number of desiderata, but a substantial gap remains between these and the highly specific criteria needed to train current AI systems with reasonable data efficiency. We made a number of difficult judgment calls, such as how to rate the trustworthiness of sources (see Appendix C), which we do not expect universal agreement with. While WebGPT did not seem to take on much of this nuance, we expect these decisions to become increasingly important as AI systems improve, and think that cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound
Sounds like the same issues academic librarians and instructors are grappling with!
Remember Google’s LaMDA (Language Model for Dialogue Applications) that convinced a google engineering it was sentient? If you read the paper you will realize it has a knowledge base bolted and uses it along the lines of WebGPT!
However, both WebGPT and LaMDA are unreleased to the public as of writing. But here are two systems that give you a taste of things.
Firstly, there is Perplexity AI. I don’t have any details on how it works beyond that like WebGPT, it combines Bing API to get results and sues OpenAI’s GPT-3 model to extract and generate answers.
Also similar is http://lexii.ai/
Below shows a trivial query that any modern-day search engine can answer (via Knowledge Graphs mostly) but it does show you the basic idea works. ChatGPT will not be able to answer this.
It’s unclear how it is trained or our closely it follows Web-GPT methods but it is still interesting as an example of what such systems could do.
So far, the results don’t look very impressive to me , the choice of sources are bad etc so my hunch is it isn’t trained or human aligned as much as WebGPT.
But of course, the search engine used here is a general web search engine, how much better would the sources look if they were using a Scholarly Search engine or Database like Google Scholar.
This is where tools like Elicit.org, Scispace etc come into play. I’ve written quite a bit about them in the past, but here’s the latest.
With Elicit.org it started off as a system using GPT-3 models over Scholarly data from the open dataset of papers made available by Semantic Scholar.
One of its major distinguishing points was it was able to extract features of papers and create a research matrix of papers using GPT.
It also uses LLMs to better select and rank top papers (so it may not use every query keyword entered), but tends to be invisible to the user. Moreover use of the latest technologies to improve relevancy is hardly new.
Want to know where all the papers on COVID-19, the regions the studies were done? Easy as pie.
Want to know the limitations of each paper? Trivial. Want columns useful for evidence synthesis such as population, number of studies, intervention, outcome measures etc? All possible.
All these and more columns are available. Want a column/feature of papers that isn’t in the predefined list say dataset used? No worries, remember under the hood the system is just asking GPT-3 to query the paper and ask questions. So just create a custom column for dataset used.
How do we know if these extractions are accurate? Click on each value and it brings you to the detailed article paper and highlights the sentences it used to generate the data.
I think it is fair to say such use of LLMs is considered unobjectionable?
But interesting things start to happen if you phrase your search as a question.
Not only can you use interesting columns like “Question relevant summary” and “Takeway suggest Yes/No”, you get a short summary of the results from the top 4 papers!
This is still a far cry from generating a full paper and honestly, if a researcher needs to resort to this to write a literature review he has bigger problems, but it hints at what is possible.
So what should we do about the threat of LLMs for plagarism?
Let’s ask Chat-GPT itself.
Not bad an answer , ChatGPT!
#1 Is certainly important and I would add probably giving them some background on roughly how LLMs works and their weaknesses. While ChatGPT might be becoming well known as word spreads, it is unclear if most people using it has any deep understanding of them.
#3 which advocates the use of plagiarism detection software as I said above I think won’t be that big a part of the solution because I’m doubtful it will be comprehensive. Still while watermarking and other technology may not be comprehensive, it might still deter most people as running a LLM on your own so it isn’t detected isn’t easy for everyone.
#4 Isn’t a solution particular to the threat of AI generated essays. Student cheating by hiring ghost writers called “contract cheating” is hardly a new thing , using AI instead is far more convenient of course.
#2 is where it gets interesting. I already hear of academic librarians who have decided there is no point trying to fight the wide and take as a given tools like chatGPT exist and try to adapt. For example, asking students to generate essays with chatGPT, show what parts are bad and improve it.
ChatGPT suggestion of using LLMs to brainstorm ideas or expand outlines also gets into somewhat grey areas.
If LLMs (or even elicit.org which has a brainstorm mode) helps me come up with ideas, should I cite it? Or is this closer to the discussion on the use of paraphrasing tools. No matter what you decide, I doubt you will go the route of some academics who asked GPT-3 to Write an Academic Paper about Itself, then tried to get it published with GPT-3 as a coauthor!
At the end of the day, we are in a situation where there are no clear answers. There’s one thing I am certain of though. The technology is only going to get better and the time to prepare for it is now.