The Practical Guide to LLMs: RedPajama

Published in

Georgian Impact Blog

9 min readSep 8, 2023

By: Rohit Saha, Akash Saravanan, Mariia Ponomarenko & Kyryl Truskovskyi

Continuing our assessment of Large Language Models (LLMs) through the lens of our Evaluation Framework, we turn our attention to RedPajama.

In this post, we aim to give you:

A brief overview of RedPajama.
Different ways to fine-tune RedPajama on custom datasets.
A comparison with BERT, DistilBERT and other LLMs on classification and summarization tasks.
Analysis of how much data and time you need to finetune your own RedPajama, and the cost.
A look at how to productize RedPajama, and the associated costs.

We are open-sourcing all of the scripts and our insights in our LLM Github repository.

What is RedPajama?

RedPajama-INCITE combines Together.ai’s RedPajama dataset and EleutherAI’s Pythia model architecture to form an open source LLM. In our view, one interesting aspect is that there is a 3B parameter size model, which is unusual in our experience. Together.ai reasons that this model size will allow for wider adoption due to smaller hardware requirements and easier experimentation.

Together.ai introduces and uses the RedPajama dataset, which is a 1.2T token open-source replication of the LLaMa dataset. That is, they follow the same steps in terms of pre-processing and filtering, use the same data sources and extract roughly the same number of tokens from each. But there’s no guarantee that this is the exact data that LLaMa used. All the datasets that have been used are fairly standard and public datasets. RedPajama is primarily an English language dataset but the authors note that the Wikipedia dataset does contain texts in 20 different languages. The specific breakdown of the dataset is as follows:

RedPajama comes in two sizes; 3B and 7B. Each model has 3 variations — base, instruction fine-tuned, and chat. Both 3B and 7B versions follow the Pythia model architecture. The Pythia model is based on GPT-3 (hence it’s decoder-only) with the following changes:

It uses fully dense attention layers instead of alternating sparse and dense attention layers. Subsequent research to GPT-3 found that this worked better.
FlashAttention (Dao et al., 2022): An exact-attention algorithm that is IO-aware and uses fewer memory accesses. FlashAttention was shown to increase speed by up to 3x compared to standard attention.
Rotary Positional Embeddings (RoPE) (Su et al., 2021): RoPE attempts to unify absolute and relative positional embeddings.
Parallelized Attention (Wang & Komatsuzaki 2021): The attention and feedforward layers are organized in parallel instead of sequentially. This arrangement is an optimization done to speed up the training process. Specifically, some matrix multiplications can be combined into a single operation.
It uses untied embedding matrices. That is, the embedding and reverse-embedding matrices are not the same. This matrix set up is intended to increase interpretability.

Evaluating RedPajama

For a holistic evaluation, we assess both 3B and 7B versions of RedPajama across the four pillars of our Evaluation Framework: Performance, Time to train, Costs and Inference.

RedPajama Performance

Let’s move on to the first pillar of our evaluation framework: Performance.

We evaluated the performance of RedPajama across the tasks of classification and summarization. All experiments were conducted on an AWS EC2 instance: g5.2xlarge that comes with 1 NVIDIA A10 GPU. It has 24GB memory, and costs US$1.212 / hour.

We use two distinct methods with RedPajama for the tasks of classification and summarization: (i) In-Context Learning, and (ii) Fine-Tuning via QLoRA. If you are unfamiliar with these two methods, our previous blog explains them with example prompts.

Here is an example of the prompts for zero-shot prompting, few-shot prompting and fine-tuning for the task of classifying categories of news documents:

Now let’s look at some numbers. Because we are solving a classification task, we track accuracy as the main metric.

With accuracy scores of 0%, we saw that the model struggles to predict correctly when used in a zero-shot setting. In the few-shot setting, similar to our findings about Falcon-7B, the model returns an error because the prompt becomes too long after including several examples. Fine-tuning, on the other hand, improves the accuracy over the zero-shot setting, with 72.34% accuracy for 3B and 75.52% for 7B, showing the merits of tuning RedPajama on proprietary datasets.

Comparing RedPajama alongside Flan-T5-Large, Falcon-7B and other popular language models, we notice the following accuracies:

Both versions of RedPajama are competitive when compared to other language models. The 7B version comes close to Falcon-7B’s performance. Furthermore, when trained on fewer samples, RedPajama performs at a similar level to other LLMs. The ablation study on sample complexity (see table below) suggests that RedPajama is also a strong candidate to consider in low data situations.

Now let’s turn to the summarization task. For this task, we use the Samsum dataset where the goal is to summarize chat conversations. Similar to classification, we create three prompts corresponding to the three settings: (i) Zero-shot, (ii) Few-shot, and (iii) Fine-Tuning with QLoRA. You can see example prompts below:

RedPajama does a better job at summarizing dialogues than classifying news documents in zero-shot and few-shot settings. However, our results indicate that fine-tuning is still more effective as it helps RedPajama learn the summarization style specific to the dataset as opposed to creating a generic summary.

When compared with other LLMs we have tested, we observe that RedPajama achieves competitive metrics, although Falcon-7B leads the table (see below).

RedPajama Time and Cost to Train

Next we move to the 2nd and 3rd pillars of our evaluation framework: Time and Cost to Train.

As mentioned earlier, all experiments were conducted on an AWS EC2 instance: g5.2xlarge that costs USUS$1.212 / hour. We can see that the training costs are just a few dollars.

RedPajama Inference

Let’s move on to the 4th pillar of our evaluation framework: Inference.

As with previous assessments in this series, we used a load testing tool called Vegeta to test how effectively the system handles a large number of requests. We selected the HuggingFace Text Generation Inference Server and FastAPI as our deployment options. We aimed to determine the maximum requests per second (RPS) each model can handle, as well as the throughput, latency and cost per 1,000 tokens. We built a collection of example sentences each consisting of ~100 tokens in order to produce the requests. Then we randomly selected one of these sentences for each request during the load testing experiment. This approach attempts to keep our testing results consistent. Through experimentation, we determined the typical ranges of RPS that each model and service could handle for each task.

To evaluate RedPajama models through FastAPI, we decided to run the service with a single worker. Under this approach, the GPU memory could be allocated for just one model instance at any given time. By doing so, we successfully averted the occurrence of “Out of Memory” errors, as the memory demands of RedPajama models surpassed the available GPU memory when multiple instances were executed concurrently. Since the TGI requires a standalone model, we have to merge the base model with the LoRA layers.

All load testing experiments have been performed on an AWS g5.4xlarge instance that costs US$1.624 per hour.

Classification

For the classification task we tested the FastAPI service for RPS ranging from 1 to 4 with a step size of 1. Then, we evaluated Text Generation Inference (TGI) for RPS ranging from 10 to 150, using a step size of 15. We then calculated the average throughput and latency for the maximum possible RPS. The tables and plots demonstrate significant differences in the response speed and load capacity when deploying the RedPajama model through FastAPI versus TGI.

RedPajama-3B + LoRA:

The inference cost for FastAPI (US$0,001 / 1K tokens) is higher than for TGI (US$0,00003 / 1K tokens). Moreover, looking at the latency value we can say that it would cost US$0.0006 to get responses on 135 requests in 1.44 seconds using TGI, which is not possible to achieve with FastAPI.

RedPajama-7B + LoRA:

RedPajama-7B performs similarly to RedPajama-3B, but with a slightly lower RPS for text-generation; this result is expected due to the larger model size.

Taking into account the latency value, it costs US$0.001 to get responses on 125 requests with the RedPajama-7B model deployed on TGI.

Summarization

For the summarization task we were testing the models for the RPS in range from 10 to 200 with a step equal to 15. You can see our results in the tables below.

RedPajama-3B + LoRA:

Even though the maximum RPS is quite similar for FastAPI and Text Generation using RedPajama-3B, the throughput and latency differ, which influences price per number of responses. According to our results, it will cost US$0.0003 to get responses for 195 requests using TGI and US$0.01 for 160 requests using FastAPI.

RedPajama-7B + LoRA:

As we saw with the classification task, the load testing performance for summarization of RedPajama-7B is similar to RedPajama-3B. The maximum RPS for FastAPI stays the same and for TGI it is lower.

Our load testing experiments highlight a key finding: the difference in model size does not have a big impact on inference. Instead, the factor influencing performance is the choice of deployment platform. In our experience, using an option like TGI has been consistently more effective than FastAPI. For instance, in the scenario of RedPajama, we observed a latency improvement of up to 12 times for both classification and summarization tasks.

Conclusion

To summarize our RedPajama assessment:

We discussed and evaluated different methods to leverage RedPajama; both 3B and 7B versions.
We assessed how RedPajama can compete with similar language models, such as Falcon-7B, Flan-T5-Large and the BERT family.
We shared the time and cost it took to train RedPajama 3B and 7B in our tests.
Finally, we showed the different ways you could productize the LLM and the associated costs.

If you are looking for more detail, our GitHub repository contains insights about hyperparameter optimization while fine-tuning RedPajama, and which search space yields better results.

This blog is the third in a series of posts exploring LLMs that we intend to make available alongside the Github repository.