The Practical Guide to LLMs: Llama 2

Published in

Georgian Impact Blog

10 min readOct 13, 2023

By: Rohit Saha, Royal Sequeira, Mariia Ponomarenko & Kyryl Truskovskyi

Continuing our assessment of Large Language Models (LLMs) through the lens of our Evaluation Framework, we turn our attention to Llama 2.

In this post, we aim to give you:

A brief overview of Llama 2.
Different ways to fine-tune Llama 2 on custom datasets.
A comparison with BERT, DistilBERT and other LLMs on classification and summarization tasks.
Analysis of how much data and time you need to finetune your own Llama 2, and the cost.
A look at how to productize Llama 2, and the associated costs.

We are open-sourcing all of the scripts and our insights in our LLM Github repository.

What is Llama 2?

Llama 2 is Meta’s latest open-source large language model, released in July 2023, and, according to Meta, can be used for commercial purposes. Llama 2 comes in three different versions: 7B, 13B, and 70B.

According to Llama 2: Open Foundation and Fine-Tuned Chat Models, Llama 2 was trained on a mix of publicly available datasets. The paper states that any source containing personal information was removed from the datasets. These datasets equate to two trillion tokens — 40% more data than its predecessor Llama 1. Llama 2 supports a context length of 4096, twice the length of its predecessor.

The training process described is very similar to Llama 1, with Llama 2 also using standard Transformer architecture. According to the paper, the changes from Llama 1 include increased context length and a new form attention mechanism (grouped-query attention). Llama 2-Chat, the model’s instruction counterpart, was trained on publicly available instruction datasets with over 1M human annotations.

The same paper states that Llama 2 was evaluated on popular benchmarks for both pre-training and fine-tuning abilities. The tasks used to benchmark against pre-trained models included, among others, common-sense reasoning, mathematical abilities and general knowledge. Meta found that Llama 2 outperforms the other open-source models they tested (Falcon, MPT, and Llama1) across all benchmarks. When pitted against closed-source models, Llama 2 performs similarly to GPT-3.5 in certain tasks such as language understanding and math word problems, but lags behind in coding tasks. In contrast, Meta found that Llama 2 matches or surpasses PaLM-540B in nearly all benchmark tasks, yet there remain substantial performance gaps between Llama 2, PaLM-2-L, and GPT-4.

Evaluating Llama 2

For a holistic evaluation, we assess the 7B and 13B versions of Llama 2 across the four pillars of our Evaluation Framework: Performance, Time to Train, Costs and Inference.

Llama 2 Performance

Let’s move on to the first pillar of our evaluation framework: Performance.

We evaluated the performance of Llama 2 across the tasks of classification and summarization. All experiments were conducted on an AWS EC2 instance: g5.2xlarge that comes with 1 NVIDIA A10G GPU. It has 24GB memory, and costs US$1.212 / hour.

We perform supervised fine-tuning via QLoRA on Llama 2 for both classification and summarization tasks. Here is an example of the prompt for fine-tuning for the task of classifying categories of news documents, which is similar to the prompts used in our previous blog posts on benchmarking LLMs:

Because the task is classification-based, we track accuracy as the main metric. We compare Llama 2 against other popular open-source models, such as Flan-T5-Large, Falcon and RedPajama.

The table below shows the performance of different LLMs with sample efficiency. The last row of the table contains the performance when the entire dataset is used. In these results, Llama 2–13B’s accuracy appears to be higher than all other models we have tested to date. We can also see that the first row of the table — corresponding to the lowest fraction of training samples — shows similar results. Llama 2–13B appears to achieve the optimal performance in a low data situation across the models we have tested so far.

Llama 2–7B, the smallest version of Llama, appears to achieve lower accuracy results than the other models across different sample fractions. These accuracy results indicate the impact of the extra parameters contained in Llama 2–13B.

Now let’s turn to the summarization task. For this task, we use the Samsum dataset where the goal is to summarize chat conversations. We create three prompts corresponding to the three settings: (i) Zero-shot, (ii) Few-shot, and (iii) Fine-Tuning with QLoRA. You can see example prompts below:

We observe a few trends in the tables below. First, Llama 2–7B’s performance appears to be higher than Llama 2–13B in a zero-shot and few-shot setting. We see that Llama 2–7B’s ROUGE-1 and ROUGE-2 scores are both higher than Llama 2–13B for zero- and few-shot. However, after fine-tuning with QLoRA, Llama 2–13B’s scores appear higher by a small margin. In our opinion, these results indicate that Llama 2–7B could be a strong candidate to consider for summarization and Q&A tasks as it delivers these results despite being smaller than Llama 2–13B.

Llama-2 7B Summarization Performance:

Llama-2 13B Summarization Performance:

Next, we compare Llama 2–7B and Llama 2–13B with other models. Both versions of Llama 2 achieve competitive results, with Llama 2–13B appearing to achieve the highest results of the models we tested. Looking at these results, Llama 2 and Falcon appear to be good candidates to consider for summarization tasks. In our view, the 7B versions of both Llama 2 and Falcon can deliver good performance at potentially lower latencies.

Llama 2 Time and Cost to Train

Next we move to the 2nd and 3rd pillars of our evaluation framework: Time and Cost to Train.

Llama-2 7B Time and Cost to Train Stats:

Llama-2 13B Time and Cost to Train Stats:

As mentioned earlier, all experiments were conducted on an AWS EC2 instance: g5.2xlarge that costs US$1.212 / hour. We can see that the training costs are just a few dollars.

Llama 2–13B’s fine-tuning takes longer than Llama 2–7B owing to its relatively larger model size. As a result, gradient updates take more time, which leads to higher training costs.

Llama 2 Inference

The fourth pillar of our evaluation framework is Inference. With Llama 2, in addition to our previous set up, we also deployed using a new method — an open-source unified compute framework Ray. We also tested all three methods on an instance equipped with an Nvidia A100 40GB GPU in addition to an Nvidia A10 23B GPU. We tried this method because we wanted to check how much the inference changes with more compute power.

As in our previous assessments, we use a load testing tool Vegeta and with the aim of finding the maximum number of requests that the server is able to handle (RPS). We also measured throughput and latency. We used a set of sample sentences, each comprising approximately 100 tokens, to generate the requests. For each request during the load testing, we randomly picked a sentence from this sample set. This method aims to maintain the consistency of our test outcomes. Through these tests, we identified the RPS capacities for each model and service for classification and summarization.

The Nvidia A100 40GB GPU on GCP cost US$3.81 an hour. The Nvidia A10G 24GB GPU on AWS g5.4xlarge cost US$1.624 an hour.

We previously mentioned our exploration of new frameworks for model serving. One of these is Ray, a tool that offers a scalable library called Ray Serve, specifically designed for constructing online inference APIs. For our testing, we kept it simple, avoiding advanced optimizations like dynamic batching or multi-node/multi-GPU serving. Another tool we experimented with is vLLm, a swift and user-friendly library tailored for LLM inference and serving. Deploying vLLm was straightforward; it required just a single command and a model stored in a HuggingFace repository. For FastApi, we used two workers to serve the model. We found that using two workers prevents “Out of Memory” errors and introduces parallelism to the request handling process. As with our previous assessments, we merged the base LLama models with Lora layers for TGI in order to complete the requirement for serving.

Llama-2- 7B Classification

For the classification task, TGI and vLLm outperformed all other deployment methods that we tested. However, Ray handled more requests than FastApi for a fraction of the cost. As for the vLLm, it can easily handle even more requests than TGI and provides higher throughput (87.06 compared to TGI’s 54.41 for Nvidia A100).

We have provided the results for TGI at different RPS in the graph below to illustrate the difference in performance when working with different instances. You can see the difference in latency when using A100 GPU in the graph. We also saw higher throughput (54.41 A100 compared to 19.81 A10) while the cost for the 1K tokens was not much higher for the A100 (US$0.00007) than when using A10 instance (US$0.00003).

Llama-2–13B Classification

The results follow a similar pattern using the LLama-2–13B model to Llama-2–7B for the classification task. Looking at our results in the table above, TGI and vLLm again appear to outperform the other deployment methods, with vLLm handling more requests.

Llama-2–7B Summarization

For the summarization task, although RPS is similar across all deployment options, we observed that TGI on an A100 appeared to result in higher RPS (205), with a latency of 0.75 seconds.

Llama-2–13B Summarization

Looking at the summarization results for Llama-2–13B, we observe that vLLm appears to achieve the highest RPS for both Nvidia A10 and A100. Of the other approaches, Ray appears to outperform FastApi (55 RPS compared to FastApi 10 RPS on Nvidia A10) with lower costs (US$0.00006 on both instances).

Conclusion

Based on our analysis, Llama 2 could be important for companies leveraging LLMs owing to its strong performance in low data situations and low costs to train.

We evaluated different methods to leverage Llama 2–7b and Llama-13B, i.e. in-context learning and fine-tuning with QLoRA, and found fine-tuning to outperform other methods across both tasks of classification and summarization.

We discussed how Llama 2–7B and Llama 2–13B appear to outperform language models such as the BERT family, and other LLMs. We observed that Llama 2–13B fine-tuned with QLoRA appears to achieve stronger results than the other models tested across both tasks, even at different sample efficiencies, based on our assessments to date. For summarization tasks, Llama 2–7B performs better than Llama 2–13B in zero-shot and few-shot settings, making Llama 2–7B an option to consider for building out-of-the-box Q&A applications.

Fine-tuning both versions of Llama 2 takes a reasonable amount of time, and the associated costs to train are low. Llama 2–13B takes longer to fine-tune when compared to Llama 2–7B, owing to the differences in their model sizes.

For inference, we tested four deployment methods on two instances. Although a more powerful instance often leads to greater throughput and reduced latency, the key appears to be selecting an efficient deployment platform. Based on our results, we believe that is vLLm and Text Generation Inference. We believe Ray may be a good fit as an inference server for LLMs if we leverage additional optimizations. In our examination of various compute capabilities, we noted that using a more robust instance, specifically the Nvidia A100, results in enhanced performance and a higher throughput and peak RPS. While this setup comes at a slightly higher cost, it may be a justifiable investment for specific scenarios where achieving optimal throughput/RPS is crucial.

If you are looking for more detail on our work, our GitHub repository contains insights about hyperparameter optimization while fine-tuning Llama 2–7B and Llama 2–13B.

This blog is the fourth in a series of posts exploring LLMs that we intend to make available alongside the GitHub repository.