Analysis of GALACTICA

Vishal Padma
Version 1
Published in
6 min readMar 17, 2023

Introduction

Galactica (GAL) was designed for organizing science, which is trained on a highly curated corpus of over 48 million papers, textbooks, and other scientific knowledge sources. Galactica outperforms existing language models on various scientific tasks including reasoning and knowledge-intensive tasks. The interface can predict citations, perform step-by-step reasoning, and interact with modalities such as SMILES and protein sequences using natural language. Galactica also has the potential to replace traditional document storage and retrieval systems with context-associative power. Additionally, Galactica can perform multi-modal tasks involving drug discovery and protein sequence annotation.

Different features of Galactica

The below table has information about the accuracy of Galactica for different features (all these results were published in November 2022).

Table: Accuracy for different models - performance of different features for different datasets.

Galactica boasts three primary features: Citation Prediction, LaTex Prediction, and Reasoning. There are five models available for Galactica, each with varying parameters that range from 125 million to 120 billion. Testing of the Citation Prediction feature revealed a clear correlation between the number of parameters and the accuracy of the predictions. The Gal 120B model proved to be the most accurate, with an average accuracy rate of at least 50%. However, predict citation feature was not tested against the models available outside of Meta.

Moving on to the LaTex prediction feature, it was tested against all Galactica models, as well as the OPT, BLOOM, and GPT-3 models. Results demonstrated that Galactica outperforms all other models in terms of accuracy.

Finally, the Reasoning feature was tested against GPT-3 and other Galactica models, with the models boasting higher parameters demonstrating the best accuracy. These findings were published in a report in November 2022, with accuracy rates below 70%, which are now considered to be subpar. However, at the time of publication, these results were deemed exceptional when compared to other models. Anyone who wishes to reproduce the above findings from the research paper can utilize the available models from the pypi python package.

Limitations of Galactica

The performance of the model is significantly affected by limitations in accessing scientific knowledge, such as papers and textbooks, which are not easily available as open access. While prompts can improve performance by allowing users to instruct the model on what to do, they cannot eliminate the effects of the model’s pre-training. Moreover, the 120B scale model shows some citation bias towards popular papers, which can be addressed through data augmentation techniques to generate new data points from existing ones.

One of the frequently observed issues with large language models is the occurrence of hallucinations. In language model processing, hallucination can be referred to as when the content returned is nonsensical or not faithful with respect to the provided source. For all the topics which are well cited, the generated output from Galactica’s model is better than for topics which are less known or less cited. For example, two types of citation identifiers were tested in the paper: (a) paper titles and (b) alphanumeric IDs. The title-based identifiers have been shown to perform better than IDs. But they discovered that papers’ titles are more susceptible to hallucination errors when viewed at smaller scales due to their reliance on textual-based identification methods.

A significant challenge associated with language models is that there is no assurance that the output they generate is accurate or dependable. Some outputs generated by Galactica might sound correct and authentic but at the same time these can be wrong in important ways. This makes the output less reliable and need to verify them at each step. This is particularly the case for highly technical content.

Feedback for the DEMO of Galactica

Note: Source for the above feedback.

Conclusion

The traditional way of accessing scientific knowledge through store-and-retrieve paradigm has limitations, leading to a bottleneck in knowledge throughput. Language models can disrupt this paradigm by absorbing technical knowledge and scaling smoothly with model size, providing advantages over search engines in the long run. Language models can also compose a curated knowledge base for knowledge-intensive question answering tasks, and act as a bridge between scientific modalities and natural language. The potential of language models is vast, and open sourcing them will allow the machine learning community to extend their capabilities.

For Galactica, despite how powerful the tool is with all the features listed above, the fundamental problem was that the tool was not able to distinguish truth from the fiction or falsehood. As stated above the model can hallucinate and provide the wrong answer for the question asked with high confidence. People have tried using the demo and when asked upon the question to which they know the answer, the problem with the incorrect responses generated by language models is that they are often provided with a high degree of confidence, which may lead to the assumption that the answers are accurate and trustworthy. This can create a need for additional verification, negating the original purpose of using a language model to avoid manual efforts. Also, the demo was taken down (on 17th November 2022) after the first three days because of the review given by the people and they said it’s not useful and can be dangerous. Currently, there is no information regarding Meta’s plan to make the demo available to the public again. The other problems would be the bias the model includes with respect to the knowledge base. Most of the datasets used for the training are open source but there are a lot more datasets which are restricted, and this can lead to bias, in turn affecting the performance.

While language models are designed to generate the most likely sentence or output, it is important to recognize that this does not always equate to accuracy. These models have the potential to generate sentences that may sound plausible but could be factually incorrect. As a result, when utilizing language models in fields such as medicine or other high-criticality use cases, it is crucial to exercise caution and carefully consider the limitations and potential risks associated with their use. It is important to note that some individuals may have had unrealistic expectations of Galactica’s capabilities. It is important to understand that language models are designed to provide the most likely answer, but they are not infallible and can produce incorrect responses.

At present, the cost for utilizing the Galactica tool is unknown, the demo was paused due to conflicting feedback, but the models are still available for use. Interested parties may attempt to replicate the results outlined in the paper using these models.

About the author

Vishal Padma is an Associate Consultant at Version 1.

--

--