Scientific Reasoning Told Me To Write This Article (featuring Meta AI)

Shirley Wang
deMISTify
Published in
6 min readDec 30, 2022

--

Will machine learning models progress to the point where they will eventually even replace scientific researchers? Meta AI created a model to explore this possibility, and the results were not promising. Let’s talk about Galactica.

The Galactica demo website

Part 1: What is Galactica?

On November 2022, Meta AI published their “revolutionary” new large language model (LLM) for science, named Galactica. The goal of Galactica was to help with what the authors call “information overload”, where there is just so much scientific knowledge and discoveries nowadays it’s nigh impossible for any one person to keep track of them all. While computer databases and search engines are helpful for keeping track of all the papers published, they can’t make any reasoning about the discoveries and create new ideas from it. That’s where Galactica was supposed to come in.

The idea is that with a large language model trained specifically on scientific literature, it would be able to find connections between different research and bring new insights to the surface. It could also potentially generate literature reviews, notes, articles, and organize different modalities such as linking papers with code.

The dataset is very important for a case like this, since the language model should be explicitly built for scientific reasoning. The authors created a dataset of 48 million papers, textbooks, lecture notes, proteins and compounds, scientific websites, encyclopedias, and more. They specifically process everything in latex markdown format for consistency. The model is pretty consistent with other LLM results so far, using a transformer architecture with only transformer decoders and a few minor changes to the architecture such as no biases.

Example output from the Galactica Paper

In terms of metrics, Galactica performed very well, achieving competitive results as well as beating many previous benchmarks for scientific NLP, reasoning, latex equations, chemical reactions, and citation predictions. The authors also note that Galactica predictions have a much lower stereotypical bias than other similar models.

In terms of limitations, the paper notes how the model was trained only on open source information, while much of science is not open source and hidden behind paywalls. They also mention how citation prediction is still somewhat biased towards popular papers, and that geometry is very important to certain domains like chemistry and geography but a text model has no way of modeling that. Another interesting note is that the model has picked up some general societal knowledge from sources like Wikipedia, but the authors don’t recommend using it for that.

Part 2: What Went Wrong

If we stopped here, there wouldn’t really be anything different from this and the story behind any other LLM paper. But the story doesn’t stop here for Galactica. In fact, three days after the demo was made public, it was bullied so much online it was taken down.

Anyone who is familiar with LLMs or has even played around with them a little knows that they will sometimes just generate some incoherent nonsense, or make things up as you ask it to. Now the people at Meta AI were not fools, they were aware of this. They even put a disclaimer of the limitations of Galactica on the website where they shared it.

However, perhaps you have noticed a little problem with this already. If Galactica is supposed to aggregate scientific knowledge and come up with potential new insights, how do I know what it generates is a real insight, and not just the language model generating something random?

Source from Twitter https://twitter.com/meaningness/status/1592634519269822464

The answer is, there really is no way to know. The sample output Galactica generated above is pretty easy for us to use our common sense and go “oh this is a fake article” since most of us know bears haven’t been in space before. But what if you asked it to generate an article about an incredible niche topic in chemistry? Suddenly the amount of people able to verify the validity of the output dramatically decreases.

It also just produces trash sometimes

There are a few more issues that immediately come into mind when we consider Galactica as the “future of science”. New papers are always coming out, Galactica would need to be constantly retrained to have that new knowledge. Since that is the case, it would probably be better to actually invest in some kind of database or search mechanism that will automatically inject that specific knowledge in the model when it’s asked for, but that’s not what Galactica is. Galactica is very simply, another large transformer that just happens to be trained on scientific papers. LLMs are very good at mimicking human language, and Galactica just happens to be another stochastic parrot that also happens to perform well on existing scientific reasoning benchmarks.

I will admit, the original concept for what Galactica could theoretically solve isn’t bad. It would be nice to ask the model and have it return a nice curated article on exactly what quantum computing is, instead of having to sift through hundreds of papers to both learn what it is as well as get caught up on all the most recent work. However, that could be accomplished by a well-trained aggregator of online sources, instead of a large language model. While Galactica’s results are neat, it does not even come close to solving the problem it proposes it could solve.

Part 3: A New Challenger Approaches

We could stop here, but even more recently, OpenAI released ChatGPT. ChatGPT is another LLM, but it’s been made open source and its performance is quite good, to the point of it becoming the next DALL-E 2 for everyone to show off their generated results online. ChatGPT also uses Reinforcement Learning to improve its results, as opposed to how Galactica uses the usual supervised learning method. I would like to show off some as well, as parallels to what Galactica has been able to produce.

A similarly coherent yet also nonsense article about bears in space.
A coherent answer to the question instead of a trash response.

Both ChatGPT and Galactica can generate fake articles about bears in space, and both of them read pretty convincingly too. However, ChatGPT does seem to have some more knowledge than Galactica does, it at least produces a coherent (and correct!) response to if vaccines cause autism.

An important part of large language models that many people are aware of is being able to prevent misuse of it, while also being able to make it open source so that people can actually use it to their advantage. After all, what is the point of improving parts of our life if we don’t share them with others? This should apply to computer algorithms, as well. ChatGPT does have some limits in place to try and prevent people from generating deep fake articles and conspiracies, but it isn’t perfect (hence bears in space). But the whole point of Galactica hinged on it being an accurate source of scientific information, which it never could have been to begin with, since it’s just a transformer trained on some papers.

When it comes to generating realistic articles, Galactica and ChatGPT seem somewhat similar. ChatGPT also seems to generate more accurate answers than Galactica, but OpenAI isn’t going around parading how ChatGPT is the future of science. Bad language models are a dime a dozen nowadays, it’s quite common for papers about LLMs to report good results while the models generate nonsense. Perhaps the problem with Galactica isn’t that it’s incorrect about science, but that it had a major PR problem.

Yann LeCun is still kinda salty Twitter bullied Galactica that much

References

  1. Taylor et al., “Galactica: A Large Language Model for Science,” arXiv.org, 16-Nov-2022. [Online]. Available: https://arxiv.org/abs/2206.11795. [Accessed: 24-Dec-2022].

--

--

Shirley Wang
deMISTify

Msc Student at UofT interested in Computer Vision