Are we undervaluing Open Access by not correctly factoring in the potentially huge impacts of Machine learning? — An academic librarian’s view (I)
Synopsis : I have recently adjusted my view to the position that the benefits of Machine learning techniques are more likely to be real and large. This is based on the recent incredible results of LLM (Large Language models) and about a year’s experimenting with some of the newly emerging tools based on such technologies.
If I am right about this, are we academic librarians systematically undervaluing Open Access by not taking this into account sufficiently when negotiating with publishers? Given that we control the purse strings, we are one of the most impactful parties (next to publishers and researchers) that will help decide how fast if at all the transition to an Open Access World occurs. For example, should we be willing to even pay a bit more so that we can transition to a Open Access world from a subscription world faster albeit at higher outflows from us? This can come in the form of the prices we pay in so called Transformative deals like Read and Publish deals or S2O etc.
In this two-part series, I will first describe some of the recent developments that have led me to this thinking. This will be followed by a part 2 that discusses the choices facing academic librarians when it comes to providing access to journal articles and whether even if this argument is true, how much if any weight we should put on this argument when negotiating access.
Open Access has a long-storied history going back to the end of the 20th Century. Yet, I think up to recently, the main benefit of Open Access was mostly seen by many as removing barriers to access for people particularly those who couldn’t afford to pay, which includes researchers from the global south and ordinary citizens who were not associated with a university.
For example, the answer given by OpenAI’s Large Language Model (LLM) — ChatGPT below is typical.
Of course, one other benefit that tends to be overlooked, or at least seldom mentioned in my experience particularly by librarians, is how in an Open Access World, we can use machines to plough through the world’s research literature to look for patterns and even possibly do a synthesis of knowledge, leading to vastly greater effectiveness and efficiency in the way we do research…..
The dream — Open Access is not just for humans, but also for machines!
One of the reasons why the benefits of Open Access tended to focus on access for humans rather than for machines is because until recently, the level of Open Access was so low, doing text-mining, NLP etc in bulk was pointless, but as the level of Open Access slowly rose things became to change.
This is not to say researchers were not quick to apply Machine learning techniques in specific areas of research, for example in domains like material science, finance, linguistics, if not for full text then in abstracts and they have made quite big splashes (see coverage in Nature on text mining issues) but not much was done on a large cross-disciplinary way due to the difficulty in getting access. It is also fair to say most academic librarians have remained on the sidelines in terms of supporting or even understanding the value of text mining.
Back in 2018, OurResearch the startup behind Unpaywall had become the industry standard for locating and pointing to Open Access copies — amassing over 20 million articles and they decided to launch a search engine and not just any search engine.
We’re building the “AI-powered support tools” now. What kind of tools? Well, let’s go back to the Hamlet example…today, publishers solve the context problem for readers of Shakespeare by adding notes to the text that define and explain difficult words and phrases. We’re gonna do the same thing for 20 million scholarly articles. And that’s just the start…we’re also working on concept maps, automated plain-language translations (think automatic Simple Wikipedia), structured abstracts, topic guides, and more. Thanks to recent progress in AI, all this can be automated, so we can do it at scale. That’s new. And it’s big.
Were 20 million open access papers enough? This was a moon-shot of a sort and they weren’t alone — the even better funded Allen Institute for AI with their flagship academic search Semantic Scholar (also see S2ORC corpus + Semantic Scholar Academic Graph API) was also working on the same problem, trying to apply state of art NLP and Deep Learning techniques on the research corpus. Some examples including summarization of papers in a single sentence — the TLDR feature and classification of citation intent which were implemented in Semantic Scholar.
It was a noble dream, an attempt to really explore the possibilities that Open Access affords us to “finally cash the cheques written by the Open Access movement” as OurResearch put it.
Ultimately, OurResearch’s attempt Gettheresearch did not seem to gain traction as their offering did not seem better than standard academic search engines like Google Scholar.
So, it seemed for a time to some like me trying to apply Deep learning on research papers was more of a hype than a real benefit.
The unreasonable effectiveness of Transformer based Deep Learning and Large Language Models (LLMs) changes things.
Since 2018, Open Access has continued to march on, in 2020 according to Dimensions, we hit a tipping point where for the first time the majority of the published research literature was Open Access.
Other developments such as the push to Open Science, rise of open scholarly metadata (including Open Citations which reached its own tipping point in less than 5 years and open abstracts) meant that increasingly there was more data available for Deep learning to work on as research trended towards becoming more open and transparent (see trends like Open peer-review, Open Data, preprints)
But all this would be moot if there were no advancements in Deep Learning/Machine learning.
But as you know this is an area where indeed there were huge advancements in the last 10 years (embeddings for NLP, image recognition) and last 5 years (Large Language models).
The world has been shocked by the rise of what is now known as large language models, starting with OpenAI’s GPT-2 in 2019, and GPT-3 in 2020, the unreasonable effectiveness of such HUGE language models in all sorts of ML NLP tasks started an arms race with big tech companies like Deep Mind, Meta, rushing to build and train ever larger models.
GPT-2 had 1.5B parameters which was considered massive, but GPT-3 had 175B parameters! Others like Deep Mind/Google’s, Gopher , PaLM are even larger with PaLM having 540 billion models (though there is some suggestion, you can get better performance with smaller models if trained with more data like Chinchilla’s 70b model).
Part of this gold rush is due to the realization that there is still little evidence of diminishing returns at least measured by the various NLP benchmarks and one can get increasing state of art performance by scaling up those LLMs (though we may be running out of quality training data).
It’s no exaggeration to say that we are not even close to exhausting the potential of LLMs. Even without scaling up the model, there are several fairly obvious ways to improve from better techniques to align LLMs via human feedback, so they produce less “harmful” output e.g. less likely to makeup things or “hallucinate” or produce racist or violent output to bolting on a search engine or knowledge base such as Google to produce grounded facts.
As I write now in early Dec 2022, the internet is all abuzz about OpenAI’s ChatGPT which seems to be a step above GPT-3 due to basically better alignment .
Like its predecessors, it can code for you, tell jokes, do math problems and much more. If you have not tried a state of art LLM since 2018, please go to OpenAI and give it a try. I guarantee you will be amazed!
More impressively, it hallucinates a lot less (I see people saying around 5%) and seems to have “better memory” and unlike GPT-3 stays on topic for longer periods and can even recall what was said in earlier prompts. All this despite ChatGPT just being considered a GPT3.5 model and not the highly anticipated GPT-4 that is supposed to be a even bigger leap in capabilities.
Like OpenAI’s ChatGPT which was designed as a chatbot, DeepMind announced LaMDA earlier this year, which was trained specifically as a “ conversational agent”. This LLM included a knowledge base but it did not seem to have as many safeguards as ChatGPT and it famously was so good at chatting that it convinced a Google Enginneer that it was sentient!
LLM use in academia
If you tried state of art LLMs for the first time, you would be amazed at how versatile they are. Unlike past machine models, you did not need to specially train models for the task at hand, they would automatically be pretty good at many tasks.
For example, I found even the early unrefined GPT-3 in 2020 was able to fake an academic librarian at a desk answering reference questions pretty well and in my experiments could generate fake news on domains as obscure as Singapore politics!
Of course, further finetuning such models with specific data appropriate to the task would potentially create even more amazing models.
If you look at LLMs and the data they are trained on, you will realize they are mostly not on scientific papers but on Wikipedia, common webpages (Common Crawl), social media pages (or links from it) etc
What if we trained LLMs on scientific research papers? BERT language models have been built in such ways e.g. BioBERT , SciBERT, PubMedBERT exist but they are relatively small models trained on domain specific papers.
An attempt to create a BERT model by training on almost all scientific papers regardless of domain was done over 75 million papers to generate ScholarBERT but even this with 770M parameters is still considered small. More importantly BERT Language models don’t allow users the same kind of generic prompting that GPT and their cousins allow.
As Josh Nicholson’s coverage of “How to build a GPT for Science” notes it is unclear how they got access to this amount of full text , though I note one of the authors of the paper is Carl Malamud who is known for launching the General Index which is an index of words ( listing of unigrams to five-grams) in over 100 million papers.
Recently, meta launched Galactica, a 120B large model trained on 48million papers, code, knowledge base, reference materal and common crawl data. Besides ScholarBERT this was the first LLM to be trained mostly using academic data only. It also was custom trained with features specific to academia such as
- tokens for references
- Recognizing LaTex
- Recognizing code, DNA sequences, Chemicals etc
However, it is important to note that this LLM was trained mostly on preprints and Semantic Scholar and Pubmed abstracts (not full-text), so even it doesn’t show the full potential of what could be done.
And of course, ultimately Meta took down it’s public demo on complaints that the LLM was too dangerous because among other complains it would produce plausible, authoritative sounding papers or literature review (often in a Wikipedia layout) that were wrong but was difficult for a layperson to notice.
It is important to note while the online web demo is down the model is still available to access freely. Even someone like me managed to load the model and prompt it with just 3 lines of code using Google Colab.
The other approach taken with LLMs use cases in academia is for startup and companies finetuning existing large LLMs like GPT-3 on academic corpus and using the resulting model on information retrieval tasks, summarizing and data extraction from papers.
Elicit.org is probably one of the first use cases to combine GPT-3 with academic corpus (from Semantic Scholar) to try to improve on the discovery process.
Among other features, it allows you to extract predefined or even custom properties for each paper and display them in a research matrix layout where each paper is shown as a row, with columns covering things like
- Metadata of paper e.g. title, abstract, DOI, Journal, Funder
- Population and sample related characteristics (e.g. age, number of participants)
- Intervention (e.g intervention, Dose, duration)
- Methodology based columns (e.g preregistered? detailed study design)
- “Question relevant summary”, “Take away suggest yes/no” — used if your query is a question
- Misc — “limitations of papers” or even create your own custom columns such as “dataset used”.
It is important to note that such data extractions are using GPT-3 to “read” the paper (usually abstract but increasingly full-text) and generate what it thinks is the answer and yes, the answer can be wrong sometimes. The nice thing is you can open each article and verify the sentences it used to generate the answer.
It is interesting to note that while many metadata columns e.g. DOI, abstract, citation counts are obtained directly from the source (Semantic Scholar), other fields like Funder may be directly extracted from full-text by the language model or even simpler techniques.
For example, in one paper, Elicit.org states that the region for the study is Australia. Why does it think so? You can see below the highlighted text for that evidence.
Elicit has many other features, but I will end by showing how it can summarise the answer from the top 4papers in the results.
Because this is such an obvious idea, I am seeing other academic products focusing on discovery emerge that leverage on LLMs using similar ideas. Examples include Scispace and Consensus.app.
Scispace also builds on GPT-3 and you can use at the individual article level to ask questions . A unique feature is that you can highlight parts of the text and ask the “AI copilot” to explain the sentences to you.
You can even clip math equations, tables and it will try to explain it to you!
Many more systems today try to function as Q&A systems to directly extract answers/claims from full-text instead of ranking documents. Consensus.app is an example.
All in all, many of these systems are starting to show what you can do when you combine machine/deep learning with full-text in papers and this is just the beginning….
For more, see my coverage on Q&A systems — Q&A academic systems — Elicit.org, Scispace, Consensus.app, Scite.ai and Galactica
Are LLMs just hype?
Just like any new technology, there are distractors.
A very influential paper — On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? critiques the trend towards LLM as risky and harmful.
Besides the harm to the environment from training such huge models due to the carbon costs, the authors worry about such models encoding biases and stereotypes which is particularly true given most of the LLMs are trained with huge un-curated datasets leading to the chance they might “reproduce and even amplify the biases in their input”.
From the philosophical point of view, the authors also have doubts that the traditional LLMs truly “understand”. An earlier paper by some of the same authors suggested that LLMs only learn the “form” of sentences and not “meaning” where “meaning” is defined as “the relation between the form and something external to language”.
If there is any “meaning”, it is from the way humans read into the generated sentences. Perhaps, another way to put it is this — As LLMs learn using association, they cannot learn real cause and effect which is need for “true” understanding.
Of course, others disagree arguing that it is possible for emergent features such that LLMs do learn cause-and-effect logic even if the LLMs (which are at the basic level just neutral network with back propagation) are purely associative.
After all, it is likely our brains are purely associative too. They argue there is no reason why LLMs can’t with a lot of examples learn Boolean logic at the sub-network level.
Moreover, even if you don’t believe that is the case, even the worst critics of LLMs sometimes concede that
if form is augmented with grounding data of some kind, then meaning can conceivably be learned to the extent that the communicative intent is represented in that data
and they give examples of datasets
which pair input/output tuples of linguistic forms with an explicit semantic relation (e.g. text + hypothesis + “entailed”). Similarly, control codes, or tokens like tl;dr, have been used to prompt large LMs to perform summarization and other tasks
in fact, the last example as you saw earlier is very common in academic tasks — such as Semantic Scholar learning of TLDR.
All in all, I am just a humble librarian, so am not qualified to weigh in on such matters. The only thing I can say is leaving aside the philosophical debate, if the LLMs work in practice, whether they really “understand” could be a moot point.
Lastly, there is no reason why LLMs combined with other AI/ML approaches (e.g. symbolic AI) cannot lead to even greater capabilities
While it is likely we are currently on “the peak of inflated expectations” on the Gartner hype cycle when it comes to LLM use, it seems unlikely to me that there is totally nothing there. In fact, I expect at the “plateau of productivity” LLMs should be ultimately quite high
If this is indeed true, this makes the benefits of Open Access far greater than we normally expect. Free access to all scientific knowledge for all humans is a great boon but this benefit may be dwarfed by the possibilities of using Deep learning and AI on Open Access papers to improve research!
This is where academic librarians come in, while most of us don’t have much expertise in the machine learning/AI area, as academic librarians that control the purse strings, we have a great impact (second only to publishers) on how and when this transition occurs.
As things stand, we are slowly but inevitably moving towards a transition from a subscription-based world to a Open Access World.
But are we undervaluing Open Access? If this holds true how should we adjust our positions? Or stepping back a step, what are some obstacles to a transition to an Open Access world? More on this in part 2.