The Practical Guide to LLMs: Falcon

Georgian
Georgian Impact Blog
9 min readAug 31, 2023

By: Rohit Saha, Angeline Yasodhara, Mariia Ponomarenko & Kyryl Truskovskyi

Continuing our assessment of Large Language Models (LLMs) through the lens of Georgian’s Evaluation Framework, we turn our attention to Hugging Face’s Falcon.

Our first blog lays out the Evaluation Framework we are using in our assessments in detail. At a high level, we are testing against four pillars that are common considerations when building verticalized LLM solutions: Performance, Time to Train, Costs and Inference.

With around 4.4 million downloads last month,¹ Falcon has received a lot of attention from the research community as the model emphasizes the importance of high-quality training data. Before researchers could use Falcon, foundational models were primarily trained on smaller high-quality curated datasets and internet data. The internet data used to train these other foundational models was filtered for quality (either by the link’s popularity or with a classifier trained on curated datasets). However, this lack of high-quality data for training caused them to hallucinate, i.e., generate inaccurate responses.

In an effort to curb hallucinations, researchers at the Technology Innovation Institute in Abu Dhabi created Falcon, which is not only open-source but comes with the Apache 2.0 License, meaning it can be used by businesses for commercial purposes.

In this post, we aim to provide:

  • A brief explanation of Falcon.
  • Different ways in which Falcon may be fine-tuned on custom datasets.
  • A comparison of our tests of BERT, DistilBERT and Flan-T5-Large across the tasks of classification and summarization.
  • Analysis of how much data and time may be required to fine-tune a Falcon model, and the potential cost.
  • A look at some possible ways to productize Falcon, and the potential associated costs.

We are open-sourcing all of the scripts and our insights in our LLM Fine Tuning Github repository.

What is Falcon?

Falcon is a causal decoder-only model, i.e, given a sequence of words, it predicts the most-likely next word.

Falcon comes in two sizes — 7 billion parameters (called Falcon-7B) and 40 billion parameters (called Falcon 40B). Each of the two sizes has two versions: (i) base, which has been pre-trained on large corpuses of text and can be fine-tuned on downstream tasks, and (ii) instruct, which has already been fine-tuned on instructions, making it, in our view, favorable for out-of-the-box chatbot and Q&A applications.

As of the time of writing, Falcon’s research paper is not out. However, the model uses two ideas that make it optimized for inference:

(i) FlashAttention (Dao et al., 2022): an exact-attention algorithm that is IO-aware and uses fewer memory accesses. The exact-attention algorithm was shown to give a speedup of up to 3x compared to standard attention.

(ii) Multiquery (Shazeer et al., 2019): an improvement on multi-head attention layers, where the keys and values are shared across all of the different attention “heads”. This process reduces memory and was shown to increase decoding speed.

What distinguishes Falcon from other LLMs is the dataset it was trained on. The following table gives an overview of the different sources that contributed to Falcon’s dataset:

At a high-level, the dataset has 800B tokens from RefinedWeb (from May 2013 to September 2022), and does not contain duplicated and toxic content. This dataset is also available for commercial use and is hosted on Hugging Face.

For the purposes of this blog post and our experiments, we chose Falcon-7B.

Falcon-7B Performance

We evaluated the performance of Falcon-7B across the tasks of classification and summarization. All experiments were conducted on an AWS EC2 instance: g5.2xlarge that comes with 1 NVIDIA A10G GPU. It has 24GB memory, and costs US $1.212/hour.

We use two distinct methods with Falcon for the tasks of classification and summarization: (i) In-Context Learning, and (ii) Fine-tuning via QLoRA.

Method I: In-Context Learning

In this method, we take Falcon as is and use it to solve our task. A popular name for this method is prompting, where the model generates text based on instructions provided by the user.

It is important to note that this method does not involve any tuning of model weights/parameters. For the task of classifying categories of news documents, here is an example of what the prompt looks like:

In the prompt above, we provide instructions about the task to Falcon, the list of possible classes and the example sentence, and then ask it to predict the class. In the data science community, this approach is called zero-shot prompting. The LLM has to predict the class without prior experience about the particular domain.

However, zero-shot prompting is challenging, so researchers employ few-shot prompting to make things a bit easier for the model. Using this method, we provide a few examples of news documents and their corresponding categories to guide the model. Here is what the modified prompt looks like:

Now that we are familiar with In-Context Learning, let’s move on to fine-tuning with QLoRA.

Method II: Fine-tuning with QLoRA

QLoRA is the quantized version of LoRA. Given the size of Falcon-7B, it is not possible to fit the entire model on 1 GPU without configuration. To make this happen, the model is loaded in a 4-bit or 8-bit configuration as opposed to the usual 32 bits. This approach allows the model to occupy less memory and allows for fine-tuning on a commercial grade GPU. Our first blog in this series evaluating Flan-T5-Large contains more detail on LoRa.

We fine-tune Falcon-7B through the mechanism of Causal Language Modeling. In simple words, given some text, the model learns how words are associated with each other and which words follow other words. To make this happen, we take the dataset containing news documents and their corresponding categories and create instructions out of them.

Here is an example of what an instruction looks like:

Once the model has learned the associations between news documents (sentence) and their corresponding categories (label), we ask it to predict the category for a new news document via:

Now let’s look at some numbers. Given we are solving a classification task, we track accuracy as the main metric.

These results show that the model struggles to make accurate predictions when used in a zero-shot setting. In the few-shot setting, the model errors out because the prompt becomes too long after including several examples. Fine-tuning, on the other hand, improves the accuracy over the zero-shot setting by more than 75%, demonstrating, in our view, the merits of tuning LLMs on proprietary datasets.

Comparing Falcon-7B alongside Flan-T5-Large and other popular language models, we observe the following accuracy results:

In our tests, Falcon-7B outperforms the other models, achieving 76.37% accuracy. Furthermore, when trained on fewer samples, Falcon-7B maintains its lead in performance compared to other models. At roughly 5,332 samples, other models catch up to Falcon-7B’s performance, indicating Falcon-7B’s effectiveness in low-data regimes.

Now let’s turn to the summarization task. For this task, we use the Samsum dataset where the goal is to summarize chat conversations. Similar to classification, we create three prompts corresponding to the three settings: (i) Zero-shot, (ii) Few-shot, and (iii) Fine-tuning with QLoRA. Below is a snapshot of what the prompts look like:

Based on our tests, Falcon-7B does a better job at summarizing dialogues than classifying news documents in zero-shot and few-shot settings. But fine-tuning is still more effective than zero- and few-shot as it helps Falcon-7B learn the summarization style specific to the dataset as opposed to creating a generic summary. We track the ROUGE-1 and ROUGE-2 scores that tell us how close the generated text is to the actual summary. ROUGE-1 computes the presence of similar individual words across both the generated text and actual summary. ROUGE-2 computes the presence of similar word-pairs across both texts.

In these cases, Falcon-7B outperforms Flan-T5-Large and Flan-T5-Base versions:

Across both classification and summarization tasks, we see the merits of fine-tuning Falcon-7B on target datasets as opposed to using it out-of-the-box.

Falcon-7B Time and Cost to Train

Next we move to the second and third pillars of our evaluation framework: Time and Cost to Train.

As mentioned earlier, all experiments were conducted on an AWS EC2 instance: g5.2xlarge that costs US$1.212/hour, which is low in our experience.

Falcon-7B Inference

With performance, and time & cost to train out of the way, let’s move on to the 4th pillar of our evaluation framework: Inference.

With Inference, we used the same approach for deployment and cost estimation for the Flan model.

Following the same process we used to test Flan-T5-Large, we are using the load testing tool, Vegeta, on Falcon. We created a script that sent varying numbers of requests (ranging from 5 to 185) in three sets, with a three-second interval to give the server time to recover. Afterward, we examined the results, excluding instances where a “too many requests” error occurred. We calculated the average throughput and latency (90%) for the maximum possible requests per second (RPS) and used this data to calculate the cost. Again, following the same process we used to test Flan-T5-Large, all of the load testing experiments have been executed on a g5.4xlarge instance.

For the summarization task, we varied the RPS from five to 180. Ninety percent of all requests had a response time equal to or less than 1.82 seconds for 145 RPS (which is the maximum number of requests the server was able to handle).

To get this number of responses in one second, taking the throughput value of 53.8, (or to get 145 responses in ~1.82 seconds) costs US$0.0008.

Falcon-7B + LoRA for summarization:

The performance of the classification model during Inference is similar to the summarization model. The maximum RPS that the Text Generation Inference (TGI) was able to handle equals 125.

Taking into account the latency value, it will cost US$0.001 to get responses for 125 requests in 2.7 seconds.

Falcon-7B + LoRA for classification:

Conclusion

Thanks to its strong performance in low data situations and low costs to train, Falcon could be important for companies leveraging LLMs.

We evaluated different methods to leverage Falcon-7B, i.e. in-context learning and fine-tuning with QLoRA, and found fine-tuning to outperform other methods across both tasks of classification and summarization.

We discussed how Falcon-7B can compete with similar language models such as Flan-T5-Large and the BERT family, and found Falcon-7B to perform better than Flan-T5-Large and BERT family in our tests. Furthermore, Falcon-7B performs particularly well when the amount of available labeled data is limited, in our experience.

Fine-tuning Falcon-7B takes a reasonable amount of time, and the associated costs to train are low. Finally, we benchmarked Falcon-7B’s throughput via Hugging Face’s text-generation pipeline. We found that Falcon -7B performs better in terms of speed for summarization than for classification, although the difference in speed between the two tasks isn’t large. As a result, the Text Generation Inference can be a fitting solution to deploy Falcon-7B for both classification and summarization tasks.

If you are looking for more detail on our work, our GitHub repository contains insights about hyperparameter optimization while fine-tuning Falcon-7B.

This blog is the second in a series of posts exploring LLMs that we intend to make available alongside the GitHub repository.

[1]: As of August 31, 2023

--

--

Georgian
Georgian Impact Blog

Investors in high-growth business software companies across North America. Applied artificial intelligence, security and privacy, and conversational AI.