2023: The Year of LLMs
By: Benjamin Ye, Mariia Ponomarenko, Kyryl Truskovskyi & Rohit Saha
Introduction
In 2023, the AI industry and the field of language models in particular, progressed significantly. Following the release of ChatGPT in late 2022, we saw the release of larger, more capable models like GPT-4, Bard and Claude. Tucked away from the spotlight there exists a thriving ecosystem of smaller, open-source models.
We believe that these smaller models are not necessarily in competition with commercial models; rather, the two complement each other. Whereas commercial models do well with generalized tasks, smaller models excel in specialized tasks. Moreover, given their open-sourced nature, smaller models can be fine-tuned to suit any downstream task.
The potential of using smaller open-source models to perform domain-specific tasks led us to investigate methods for fine-tuning and serving them in production. In this blog post, we will go over our high-level learnings and also revisit our earlier works related to open-source models.
Our Research in 2023
In 2023, we delved into the capabilities of popular open-source LLMs. In our experiments, we tested model performance uplift after fine-tuning, as well as latency and cost when these models are put into an inference server. Experiment results are documented for Flan-T5, Falcon, RedPajama, and Llama 2.
We also ran a comparison study of the three 7B-parameter models that were considered industry-leading at the time of publication: Llama 2, Mistral 7B, and Zephyr.
Lastly, we investigated the impact of the choice of GPU and inference optimization on the performance (namely throughput and latency) of the inference endpoint.
In this blog post, we have condensed the main lessons learned from our experiments into two sections: fine-tuning and inference.
Fine-tuning
Looking forward to 2024, some speculate that the performance of OS-LLMs will reach parity with closed-source, commercial alternatives. Indeed, there is some indication that OS-LLMs are catching up in general performance capabilities as shown by benchmarks such as MT-Bench. When trained in specialized domains (such as coding), open-source models appear to be capable of surpassing their closed-source counterparts, as measured by benchmarks such as HumanEval. These developments are expected to assist in developing enterprise applications where users seek models that are:
- Specialized to complete downstream tasks as opposed to generalized chat;
- Hosted on custom infrastructure where enterprises have control of own data; and
- Cheaper and faster to train and host compared to commercial APIs
Observations
We have fine-tuned leading OS-LLMs in the 7B- to 30B-parameter space for summarization and classification using QLoRA technique outlined by this paper. Our results are discussed below:
Summarization Fine-Tuning Results
We used the Samsum dataset to evaluate dialogue summarization capabilities and observed that open-source models perform fairly well out-of-the-box, but tend to underperform closed-source models such as GPT3.5 and Jurassic J2 x. After fine-tuning, the OS-LLMs in the 7B to 13B parameter spaces performed better than the untuned baselines of tested open and closed-source models.
Text Classification Fine-Tuning Results
For text classification, we used the 20 Newsgroups dataset. We fine-tuned OS-LLMs using different training set sizes (For illustration please see the results table below). We observed that LLMs perform better than BERT in a low-data regime, potentially demonstrating that LLMs can generalize their pre-trained knowledge to new domains. Among the OS-LLMs we tested, Llama-2–13B performed better than its peers. Llama-2–13B also appears to be more sample-efficient. When we used the entire 20 Newsgroups training set, Llama-2–13B performed similarly to a fine-tuned GPT-3.5. This performance is in comparison to other OS-LLMs which achieved accuracies in the 72–75% range, in-line with GPT-3.5 after it was fine-tuned with 2.5% of the training set.
Performance
Performance
OS-LLMs demonstrated a significant uplift in performance after fine-tuning for a particular downstream task. In most cases, fine-tuned OS-LLMs performed better than untuned commercial LLMs — which we view as remarkable given that the size of an OS-LLM is smaller.
We also tested different methodologies and hyperparameter settings in our experiments and found the following:
Target Modules
By default, Hugging Face’s PEFT library only generates low-rank adaptors for the attention layers (query and value, in particular). This default setting can be overridden by supplying desired target_modules in PEFT’s LoRAConfig object.
When testing the Zephyr model, we ran one set of experiments with the default target_modules= [“q_proj”, “v_proj”].
For another set of experiments with Zephyr, we used target_modules=[“q_proj”,”k_proj”,”v_proj”,”o_proj”,”gate_proj”,”up_proj”,”down_proj”,”lm_head”].
All transformer modules — and low-data regimes in particular — that we tested performed better after tuning. (See the table below.)
It appears that with full module tuning, Zephyr is able to converge faster. With only 2.5% of training samples, it is able to compete with attention-tuned Zephyr models that have seen 25% of samples. This performance uplift, however, comes with a caveat: we saw a ~5x increase in trainable parameters as a result of tuning all modules in the Zephyr model. This trade-off, however, seems worthwhile if compute and training time is not a concern, especially if one does not have a rich set of labeled data.
NEFTune
As part of our Zephyr experiments, we also tested whether NEFTune was capable of increasing performance of an already fine-tuned model. NEFTune appears to improve model performance for those models that are fine-tuned using smaller amounts of data, as seen through the model’s rapid convergence at 2.5% and 5% sample fraction. As training data grows, however, models tuned with NEFTune alone seem to exhibit slower convergence.
Prefix Tuning
We also tested prefix tuning, and the results did not appear to be promising. For defined downstream tasks, QLoRA were able to achieve better and predictable improvements in performance.
Takeaways
Based on the results from our experiments, fine-tuning OS-LLMs with LoRA and QLoRA may be a useful method for adapting models to complete specialized downstream tasks. We have run experiments in both summarization and classification tasks and saw OS-LLM performance improvement in both areas following fine-tuning. In addition, OS-LLMs can be trained and served in custom environments without passing data on to a third party.
Finally, some of the most popular OS-LLMs such as Llama 2, Mistral, and Zephyr may be smaller in terms of parameter counts, but as a result, they may be faster and cheaper to train and serve. In sum, fine-tuning OS-LLMs may be a viable alternative for researchers who are currently relying on commercial APIs to complete specialized tasks.
Fine-Tuning Results Table
We have distilled our results into the following table as a starting point for researchers seeking to configure custom fine-tuning jobs.
Inference Speeds
A key challenge in deploying LLMs is achieving fast inference speeds. In our view, addressing inference involves answering several critical questions:
- What factors contribute to the slow performance of LLMs in production environments?
- How can LLMs be effectively integrated into complex ML applications with multiple components?
- What strategies can be employed to accelerate LLM inference?
In this section, we will summarize our findings and insights on inference speeds.
Observations
Over the course of our experiments on inference speed we have observed that the size of the model, the choice of deployment platform and hardware appear to significantly influence performance. Typically, benchmark metrics improve with smaller model sizes, more powerful GPUs (such as using Nvidia A100 instead of Nvidia A10), and specialized inference servers, rather than custom-built web servers.
In our Inference blog post, we discussed the potential capabilities of the new inference engine, vLLM. While servers like Ray and Triton offer better structure, organization and more efficient hardware utilization, some bottlenecks in the LLM inference may remain, such as memory management during attention computation. The memory management bottleneck can be effectively addressed by the PagedAttention mechanism.
One of our past experiments involved deploying a custom Llama 2 classification model using FastAPI, Ray, Text Generation Inference (TGI) and vLLM to investigate the server’s peak load capacity with varying model sizes and hardware configurations.
The server’s peak load capacity can be defined as the maximum number of requests that can be processed in one second. Although there’s no guarantee that the server will consistently process that many requests every second without the appropriate infrastructure, our benchmark results provided interesting insights into the relationship between hardware, serving platform and model size.
See the below charts demonstrating performance for two models: the Llama-2–7B classification model and the Llama-2–13B classification model.
What conclusions have we drawn from our experiments?
- FastAPI and Ray (represented by the yellow and red plots) demonstrated significantly lower performance compared to TGI and vLLM (depicted by the white and green plots).
- The peak number of requests for vLLM can be up to five times higher than for Ray. Latency decreases drastically with Nvidia A100 compared to Nvidia A10. Latency decreases from 15 seconds to less than five seconds per request for Llama-2–7B.
The most noticeable difference in the plots lies between the performance of FastAPI/Ray and vLLM/TGI. This variance is attributed to the methods used to generate output tokens during inference. For FastAPI and Ray, we used the built-in generation method from Hugging Face’s transformers library; for vLLM, we used a custom engine using the PagedAttention technique. We think that PagedAttention can more efficiently manage the key and value memory used by the attention mechanism, thereby significantly boosting performance.
Performance
The performance of LLM inference can be characterized by latency and throughput. Latency refers to the maximum time taken to process 90% of requests. Although we can send a certain number of requests per second, not all of them will be processed simultaneously. This limitation leads to the consideration of throughput — or the number of requests the server can process in one second.
To obtain these metrics, we used the benchmarking tool Vegeta. (For more information on Vegeta please see our previous blog posts on the tool.) As previewed above, given that vLLM is the best performer when looking at inference latency and throughput, we used existing integrations of different serving platforms with vLLM. Some platforms (e.g.Ray) use vLLM directly as a library, while others (e.g. TGI) implement specific components like the custom CUDA kernels developed by vLLM in their version of PagedAttention.
Ultimately we discovered that the generation of tokens, i.e., text generation, occurs similarly regardless of the method of employment of vLLM provided that one employs PagedAttention to address memory bottlenecks. The table below presents our benchmark results for serving the Llama-2–7B classification model on various servers. Over 10 minutes, we sent a fixed number of requests (as indicated in the RPS column) to the server every second. The latency remains consistently under one second across all cases, and it’s noteworthy that the throughput value doesn’t vary significantly from the RPS value, indicating that the number of processed requests per second closely matches the number sent.
Moreover, we observed that the size of the model may influence the number of requests the server can process per second, but does not appear to influence the latency itself.
Throughput can be improved by strengthening the infrastructure around the LLM. For example, deploying the model on Amazon Sagemaker, which comes with more optimizations including several managed service capabilities like autoscaling, improves the model’s throughput. In a previous inference-dedicated blog post we saw that it’s possible to send requests to the server over an extended period (one hour in our case) without causing the server to crash.
Key Takeaways
What features does each LLM server provider cover?
The following table outlines the most commonly observed features of different inference servers.
vLLM is primarily used as a library offering an asynchronous LLM engine enhanced by PagedAttention for greater throughput. The vLLM team has also developed a FastAPI server utilizing this engine. Beyond these features, however, we think that vLLM offers limited functionality for more complex scenarios. vLLM shares this limitation with TGI.
Triton and Ray, on the other hand, may be more suitable for building complex ML applications. They share several core features in model management and deployment. Both platforms support concurrent model execution, allowing multiple models to run simultaneously for improved efficiency. They also utilize dynamic batching techniques to group inference requests and thereby enhance throughput. Triton and Ray are both framework agnostic and offer GRPC protocol support for model inference.
One of Ray’s distinguishing features is its effective resource allocation, which includes fractional resource allocation. This feature may be particularly useful when working with a limited number of GPUs/CPUs, as it allows precise allocation based on the actual needs of the models for inference. Furthermore, Ray performs well when serving multiple applications, making it a good option for scenarios where models or business logic are logically segmented. With Ray, business logic can be divided into distinct applications for organization and management.
When to use what?
For quick deployment of LLMs, one of the best methods, in our opinion, is a TGI server, which can be easily integrated into applications using a designated inference client from the Hugging Face Hub library.
If a company’s application consists of multiple services — for example, a speech recognition and synthesis system with LLMs being one of many components — then Ray paired with the vLLM engine may be a viable option. This combination can offer the advantages of model composition support provided by Ray and the strong serving throughput afforded by vLLM.
The Hugging Face LLM Inference container with Text Generation support deployed on Amazon Sagemaker may assist servers in handling a larger number of requests, thereby preventing servers from crashing under consistent load. Companies may also consider deploying applications via Ray Clusters onto cloud platforms such as AWS and GCP and manage the cluster via Kubernetes.
Cost
Most of our cost calculations were conducted for classification models where we were generating a small number of tokens. In general, the cost varies from $0.0002 — $0.0006 to process 1K tokens. The below table outlines our estimates for several LLMs deployed on the Amazon Sagemaker endpoint, which can be rented at an additional $2.03 per hour.
Or, it is 63 times more expensive to classify 1M sentences using GPT-4 (as an example of commercial LLMs) rather than a custom LLM run on any of the aforementioned servers, according to our experiments. See the chart below.
Conclusion
This post concludes our series of investigations into open-source LLMs. We hope that our findings can help accelerate the timeline and effectiveness of LLM experiments. Code used for the fine-tuning and inference benchmarks referenced herein can be found in Georgian’s LLM Finetuning Hub repo.
Key takeaways related to fine-tuning include:
- Fine-tuning with LoRA produces strong results. Smaller open-source models might not be as eloquent and knowledgeable out-of-the-box as their commercial counterparts. but, after fine-tuning for specific tasks, open-source models can adapt well with very limited training samples. With high-quality training data, the results may match GPT-3.5 and even GPT-4.
- Fine-tuned LLMs are more sample-efficient than BERT. Across our classification experiments, we observed that newer LLMs may adapt much faster than classic BERT models. While it took BERT 5,332 samples to reach 72% classification accuracy, it only took 1,066 samples for Llama-2–13B to reach the same level of performance.
- Hyperparameters matter. We have experimented with hyperparameters for fine-tuning, and we find that tuning all modules makes convergence faster. In addition, NEFTune can make LoRA tuning even more sample efficient.
Key takeaways related to servers include:
- Consider vLLM. We learned from our experiments that vLLM seems to have a throughput and overall inference speed advantage over other frameworks such as Hugging Face and FasterTransformers.
- Consider Ray (or Triton) for building complex ML applications. Ray may be a good option for building complex ML applications for users who have a system consisting of many services and components of which an LLM is only one part. With its clean and elegant method of declaring different services and small apps within a system and its smart allocation of resources, Ray may save users a significant amount of time in developing and deploying the final product.
- Experiment with LLM deployment. Modern tools and platforms reduce the complexity involved with leveraging LLMs in applications, with relatively little investment.