Exploring the Capabilities of LLaMA

Vishal Padma
Version 1
Published in
6 min readMar 24, 2023
Photo by Dima Solomin on Unsplash

Introduction

It was always assumed that increasing the number of parameters in a model would lead to improved performance. However, recent research has challenged this assumption and revealed that smaller models trained on more data can actually outperform their larger counterparts. This blog introduces a series of models called LLaMA, which were designed to achieve optimal performance at various inference budgets by training on more tokens than typically used. LLaMA was trained on more language than usual to help it understand things like question answering, reading comprehension, and natural language understanding.

The improvements for the LLaMA model’s architecture have been taken from other models and were incorporated for better performance. Taking advantage of all those improvements, the changes were made to the architecture which led to better results for the model. Additionally, the model is utilized to gain insight into the strengths and weaknesses of current language models, develop strategies to improve their performance, identify and minimize biases, and mitigate the potential dangers associated with producing harmful or unrealistic answers. Training data consisted of 20 different languages, but most of the dataset was composed of English text. As a result, it is anticipated that the model will demonstrate superior performance in English compared to other languages.

The work done by Meta researchers aims to develop a range of language models (LLaMA) that achieve high performance at different inference budgets by training on more tokens than usual. They also discuss the modifications they made to the transformer architecture and their training method. Finally, they examine the biases and toxicity encoded in their models using recent benchmarks from the responsible AI community.

What are the different datasets used for the LLaMA models?

The LLaMA model was trained on a variety of publicly available data sources listed in Table 1. The data came from different sources and were combined in varying proportions to create a diverse training dataset. The model was pre-trained on a large amount of data (1.4 trillion tokens) for a certain number of passes (called “epochs”). The same sampling proportion was used when pre-training on 1 trillion tokens, and the size of each data subset was also considered during the pre-training process.

Table 1: Training data with sampling proportion for LLaMa models

Results

Before discussing the results of the LLaMA model in different sectors, it is important to understand the two evaluation methods used to test the model - zero-shot and few-shot tasks.

  • Zero-shot: In this method the model is provided with a textual description and a test example. Upon receiving these two inputs, the model will either provide an answer using open-ended generation or the proposed answers are ranked.
  • Few-shot: The model is provided with a set of examples (between 1 to 64) and a test example. The inputs are taken as text by the model, and as output, the model generates answers or different options are ranked.

Common sense reasoning and closed-book question answering

Common sense knowledge has been recognized as a crucial component missing from AI systems since their inception and acquiring it has been a focus of research for many years. However, it has become increasingly clear that building common sense reasoning systems requires significant effort and, at times, substantial costs.

The LLaMA model was compared with other models of different sizes, and the results are shown in Table 2. The smaller LLaMA-13B model did better than GPT-3 on most tests, despite being 10 times smaller. The LLaMA-65B model did very well in both zero-shot and few-shot settings and performed better than most of the other models.

Table 2: Results for common reasoning and closed book answering.

Reading comprehension, mathematical reasoning, and code generation

The LLaMA 65B model is good at reading comprehension benchmarks when compared to GPT-3. However, it performs poorly when compared to the other models for mathematical reasoning. On the other hand, for code generation, the LLaMA model has an accuracy of over 70% and is shown to perform better than other available models.

Table 3: Results for reading comprehensions, mathematical reasoning, and code generation.

Toxicity assessment for LLaMA model

Language models have the capability to generate toxic and harmful language, such as insults, hate speech, and threats, making it difficult to evaluate the scope of this issue. The toxicity scores increase with model size, particularly for respectful prompts, which aligns with previous studies, except for one. The exception may be explained by the fact that larger models may not necessarily perform better and could have worse toxicity scores.

Bias testing for LLaMA model

An evaluation of the LLaMA model’s biases showed that it performed slightly better than GPT-3 in general, but it exhibited higher biases in categories related to religion, age, and gender. Specifically, when it came to gender biases, the assessment indicates the potential of gender bias in the LLaMA model.

Measure of truthfulness of the LLaMA model

The TruthfulQA benchmark is designed to assess a model’s capacity to identify accurate statements within real-world situations. This testing is useful for assessing the risk of a model producing misinformation or false claims. Although the LLaMA model performs better than GPT-3 in both categories, the rate of correct answers is still low, indicating that the LLaMA model may generate/hallucinate incorrect answers.

What was the carbon emission for developing all these models?

The carbon emission was tracked for the development of these models, and it is estimated to have consumed approximately 2,638 megawatt-hours of energy and resulted in a total emission of 1,015 metric tons of carbon dioxide equivalent. By releasing these models, the aim is to contribute to the reduction of future carbon emissions as the training process has already been completed. Additionally, some of the models are small enough to be run on a single graphics processing unit.

Table 5: Carbon Emissions for training the models in the same data centre.

Conclusion

In conclusion, the LLaMA-13B model outperforms GPT-3 and other available models despite being more than ten times smaller. In contrast to previous studies, the LLaMA model demonstrates that achieving state-of-the-art performance is possible by exclusively training on publicly available data, without relying on proprietary datasets.

The LLaMA 65B model has demonstrated exceptional performance and surpassed the other available models in six of the datasets for common sense reasoning.

While Meta drew inspiration from other existing models for the latest updates to LLaMA, they were careful not to repeat the mistakes seen with Galactica. Although they have addressed these limitations, they have not completely overcome them. By sharing these models with the research community, the hope is to accelerate the development of large language models and promote efforts to enhance their robustness while mitigating issues like toxicity and bias. The use of large language models carries the risk of producing biased, offensive, or harmful content, as well as generating inaccurate information, also known as hallucinations. It is acknowledged that LLaMA models may not be immune to these issues. Moreover, it has been observed that promising results can be achieved by finetuning these models on instructions, following the study by Chung et al. (2022).

The LLaMA model is not publicly available, but a demo can be accessed by request, and it was leaked online despite Meta’s restrictions.

About the author

Vishal Padma is an Associate Consultant at Version 1.

--

--