The Practical Guide to LLMs: Flan-T5

Published in

Georgian Impact Blog

9 min readAug 4, 2023

By: Rohit Saha, Mariia Ponomarenko & Kyryl Truskovskyi

If the last 6 months of AI research felt like a decade to you, you are not alone! With a new Large Language Model (LLM) released every other week, it has been challenging for the research community to keep up with the current pace of innovation in AI. While there has been a flurry of blog posts, tweets and code snippets on social media showcasing the power of LLMs and how to set up chat applications using them, we have seen few efforts that stress-test them for real-life business use-cases.

Through a series of blog posts, we aim to bring you insights that we have learned via experimentation with a suite of LLMs. Today, we will be focusing on one of the most popular open-source LLMs: Flan-T5 .

In this post, we aim to give you:

A brief explanation of Flan-T5.
A comparison with the popular Bert and Distilbert models across classification and summarization tasks.
Analysis of how much data and time you need to train your own Flan-T5, and the cost.
A look at some of the leading ways in our experience to productize Flan-T5, and the associated costs.

What’s more, we are open-sourcing all of the scripts in the form of an LLM Github repository. This blog is the first in a series of posts exploring LLMs that we intend to make available alongside the Github repository.

What is Flan-T5?

Flan-T5 is an open-source LLM that’s available for commercial usage. Published by Google researchers, Flan-T5 is an encoder-decoder model pre-trained on a variety of language tasks. The model has been trained on supervised and unsupervised datasets with the goal of learning mappings between sequences of text, i.e., text-to-text.

In our view, what sets Flan-T5 apart from other models is that its training is based on prompting. In other words, the model has knowledge of performing specific tasks such as summarization, classification and translation to name a few. For instance, if you were to feed this blog post into Flan-T5 and tell it to “summarize this article”, it would know that it needs to generate a shorter version of this article. If you download the model from Hugging Face, you can start using it right away for general purpose applications.

Evaluating Flan-T5

However, to build verticalized solutions on your proprietary data with this model, you would first have to consider: Performance, Time to train, Costs and Inference.

If you have already trained and deployed language models, such as Bert, you may already be familiar with this evaluation framework. In other words, Flan-T5 is very similar to Bert but is a larger and more powerful model.

Flan-T5-Large Performance

Let’s move on to the first pillar of our evaluation framework: Performance.

We compare the performance of fine-tuning Flan-T5-Large on two tasks: Classification and Summarization. Flan-T5 comes in various sizes; for our experiments, we chose Flan-T5-Large, which has 780M parameters. Furthermore, all experiments were conducted on an AWS EC2 instance: g5.2xlarge that comes with 1 NVIDIA A10G GPU. It has 24GB memory, and costs $1.212 / hour.

A quick aside: Fine-tuning the full Flan-T5 model comes with significant hardware and overfitting challenges. Imagine fitting such a huge model in 1 GPU! A nightmare, right? To circumvent this, we will be performing Parameter Efficient Fine-Tuning (PeFT) via LoRA. In simple words, LoRA is a technique to fine-tune massive models (LLMs) without actually fine-tuning the base model. It works by adding some extra layers throughout the base model, and tuning only those layers. In the case of Flan-T5-Large, LoRA adds layers that contain only 4.7M parameters. You can think of these layers as experts that are focused on learning patterns present in your dataset, while leveraging the generalist knowledge already available in Flan-T5-Large’s 780M parameters.

Back to the task. For classification, we used the open-source News Group dataset. This dataset contains news documents that are labeled into their corresponding categories such as: sports, politics, finance, etc. We compare Flan-T5-Large’s performance against Bert (110M) and Distilbert (66M), and observe the following accuracy scores:

At first glance, it appears that Flan-T5-Large achieves parity alongside Distilbert and Bert when the entire dataset is used for fine-tuning. However, often businesses don’t have as much labeled data to train these models as the 10,664 in this dataset. Next, we performed an ablation study to see what happens to performance when we decrease the amount of training data.

We can see that Flan-T5-Large does a significantly better job when compared to Distilbert and Bert on a sample size as low as ~250! As we steadily increase the number of samples, Distilbert and Bert finally catch-up to Flan-T5-Large, making Flan-T5-Large, in our opinion, a good candidate to consider in low-data situations.

In case you are wondering whether the training conditions were different causing one model to outperform the other, we fine-tuned each model for exactly five epochs, i.e., five full passes over the training dataset, and evaluated it on a consistent test set.

Flan-T5 Time and Cost to Train

Next we move to the 2nd and 3rd pillars of our evaluation framework: Time and Cost to Train.

As mentioned earlier, all experiments were conducted on an AWS EC2 instance: g5.2xlarge that costs $1.212 / hour. Training such a huge model on roughly 10,000 samples cost less than a cup of coffee, which is surprisingly low in our experience.

At this time, we will evaluate Flan-T5-Large on a summarization task for which we used the Samsum dataset. We compare the performance of Flan-T5-Large + LoRA (4.7M parameters) against full fine-tuning of Flan-T5-Base, i.e., tuning the whole 250M parameters. Full fine-tuning Flan-T5-Large without LoRA is challenging so we chose Flan-T5-Base for this comparison. The metric being computed is called ROUGE1 which is commonly used to evaluate models for the task of summarization.

The above training took 2 hours 47 minutes over 5 epochs, costing ~$3.6.

Flan T-5 Inference

With performance, and time & cost to train out of the way, let’s move on to the 4th pillar of our evaluation framework: Inference.

We have observed that companies can be worried about using their own LLMs because of the cost of maintaining them in production. So far, we have not seen any discussions on estimating costs for productizing LLMs. As a result, we will calculate them based on assumptions that are commonplace in ML production systems. Of course, inference costs are dependent on your infrastructure, usage and existing engineering system.

To keep things simple, we have created a framework to estimate prices. We have added some fundamental assumptions for a specific example. Using this framework, you can:

Evaluate costs of different models, and find which one is suitable for your use.
Estimate cost for your specific task by changing the input and load format in provided boilerplate code.

There are many web servers for deploying LLMs such as text-generation-inference, OpenLLM and mosec. Eventually, we hope to compare them, but for today, we will use a basic setup with FastAPI and Triton inference server without any optimization under the hood.

All benchmarks were conducted on a g5.4xlarge AWS instance costing $1.624 (on-demand price as of June 2023). For stress-testing purposes, we used a load-testing tool called Vegeta and loaded the web servers with an increasing number of requests per second (RPS) until latency started to degrade significantly, or we started getting timeout errors. We conducted experiments for each RPS value multiple times (3–6 times) and calculated the average latency and throughput.

It is worth mentioning that when we perform a load test with a tool like Vegeta and set the request rate to n requests per second, it means that the tool attempts to simulate an average of n requests per second over the duration of the test. It doesn’t guarantee that exactly n requests will be served and completed within each second.

Inference costs are derived from:

Total tokens server can process in 1 hour = (rps * average number of tokens (input + output) * 60 seconds * 60 minutes)
Price per hour = from AWS
Inference cost = Price per hour / (Total tokens in 1 hour / 1000)

This is a specific calculation, but using it makes it easy to compare other LLMs, including closed LLM APIs (such as GPT-4 and Writer Parmila). In future posts, we intend to add more optimization, with the aim of reducing these numbers. Remember, these numbers are only references.

Flan-T5-Large + LoRA for summarization:

Flan-T5-Large + LoRA for classification:

FastAPI

For the summarization task, we varied the RPS from 5 to 30, and examined the system’s responsiveness across different load levels. We discovered that 90% of all requests had a response time equal to or less than 18.27 seconds (for 30 RPS). The plot also shows that as RPS increases, the 90th percentile latency rises gradually, signaling potential performance limitations. We found out that 35 requests per second is a critical threshold where the system fails.

The throughput value was reported as 1.5. This value represents the average number of requests successfully completed per second during the load test.

Next, we performed the same load testing experiments for the classification task. Here the maximum number of requests the system can cope with is much higher — 185 requests per second. Respectively 90% of all requests had a response time equal to or less than 28.01 seconds. However, as with the summarization task, the throughput is much lower, at 5.84, meaning that approximately 5.84 requests were processed and received a response per second.

Hugging Face Text Generation Inference

Text Generation Inference (TGI) server developed by Hugging Face allows faster text generation by using advanced techniques like Tensor Parallelism and dynamic batching with popular open-source Language Model Libraries (LLMs) such as StarCoder, BLOOM, GPT-NeoX, Llama, and T5.

This time for the summarization task we varied the RPS value from 5 to 120. 90% of all requests had a response time equal to or less than 2.03 seconds (for 120 RPS).

The throughput value was reported as 45.5, which is much higher than the value we were able to get using FastAPI.

As for the classification task, the maximum RPS that TGI was able to handle equals to 145. The throughput value is 78.5 which is much higher than the value we got during load testing of FastAPI. Moreover, the Latency (90%) is also lower and equals to 1.5 s per request.

Conclusion

This brings us to the end of our Flan-T5-Large assessment. To summarize:

We showcased how an LLM, in this case Flan-T5-Large, can compete with the BERT family in a low-data regime.
We shared the time and cost it takes to train a Flan-T5-Large.
Finally, we showed the different ways you can productize the LLM and the associated costs.

If you are looking for more detail, our GitHub repository contains insights about hyperparameter optimization while fine-tuning Flan-T5, and which search space yields better results.

This blog is the first in a series of posts exploring LLMs that we intend to make available alongside the Github repository.