Exploring delivery methods for LLMs to customers

Published in

Georgian Impact Blog

11 min readFeb 13, 2024

By: Mariia Ponomarenko & Kyryl Truskovskyi

An abstract image of a landscape of neural networks with connecting glowing nodes, visualizing the concept of ‘inference’. — This image was generated using DALL-E 3.

Since the release of ChatGPT, companies have started to explore the potential of GenAI, especially Large Language Models (LLMs), to enhance their business processes, improve product quality, attract more clients and gain a competitive edge in the market.

We believe that the aim of many trained models is to become accessible for other people to use. Typically, if you are using closed-source LLMs with existing APIs, deployment concerns are minimal. However, with fine-tuned open-source LLMs, the challenge becomes selecting the appropriate way of delivering the LLM to customers.

If you have been following our series of blog posts about LLM fine-tuning, you might have noticed our efforts to deploy these models and assess the cost of inference.

In this blog post, we continue to delve into the world of open source tools that facilitate the deployment of machine learning models, including LLMs. Our aim is to provide a potential estimate of the costs involved in deploying your own LLMs using specific tools by comparing different options and explaining the benchmarks that guided our analysis.

When deploying machine learning models, we believe it is important to consider the capabilities of the developed application. In the case of LLMs, we think these questions are important to ask:

How much will it cost to process 1,000 input tokens during a certain amount of time?
Will the users experience a long waiting time? How much time will it take to process one query and generate an answer?
How many requests per second can the server handle?

In our opinion, factors such as (i) the choice of inference server, (ii) the hardware being used for inference (instance type), and (iii) model size matter. We’ll explore how in this blog.

Tools and resources

Inference services

If you have successfully fine-tuned your open-source model — what’s next? How do you make it accessible to the public?

The first option is to develop a server where you load the model, define the endpoint and process incoming requests. This process can be achieved using web frameworks such as FastAPI or Flask. However, in our view, there are significant limitations to this approach; these general-purpose web servers are not inherently designed for AI inference tasks. Features that boost inference such as GPU acceleration, dynamic batching or multi-model inference are not readily available by default.

Therefore, creating such a server on your own with all these features would require considerable human resources and time, particularly if you are at the beginning of your journey into the world of open-source LLMs and their deployment. For this reason, many companies have developed their own inference servers that support the features needed for hosting LLMs and can be relatively easy to use. We will compare efficiency, ease-of-use, how the server handles big workloads and how much support there is for various types of LLMs. Additionally, we will try to estimate the price in case of deployment of the text-generation model that was fine-tuned for a classification task.

During our experiments, we looked through various services that provide LLM inference capabilities. We looked at the following ones:

Text Generation Inference (TGI): A toolkit developed by Hugging Face for deploying and serving LLMs. TGI enables high-performance text generation for the most popular open-source LLMs, including Llama-2, Falcon, StarCoder, BLOOM, GPT-NeoX and others.
vLLM: A fast and user-friendly library for LLM inference and serving. It can be easily used with Hugging Face model and provides continuous batching of incoming requests, optimized CUDA kernels and high throughput.
Triton server with vLLM backend: An open source inference serving software that streamlines AI inferencing. With Triton, any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL and more, can be deployed.

The Triton backend for vLLM is designed to run supported models on a vLLM engine. When using this backend, all requests are placed on the vLLM AsyncEngine as soon as they are received. Inflight batching and PagedAttention is handled by the vLLM engine.

Ray with vLLM support: An open source framework that is helpful for scaling and serving Python and machine learning applications. Among numerous libraries dedicated for handling different parts of the ML lifecycle, Ray also has a library solely dedicated to serving, which is called Ray Serve. The advantage of RayServe is that it is optimized for serving LLMs, and what may be especially beneficial is that it is suited for model compositions and many models serving.

In terms of vLLM support, it also uses vLLM AsyncEngine for processing incoming requests.

In our view, what sets TGI and vLLM apart is that, in addition to the features supported by most inference servers, they employ a unique technique called PagedAttention, which allows efficient management of the attention key and value memory. Specifically, TGI’s implementation of PagedAttention utilizes the custom CUDA kernels developed by vLLM.

Therefore, PagedAttention, in our view, is one of the strongest options to consider for the fast inference for LLMs. Besides using pure vLLM for the inference, additional infrastructure may be needed around models in order to create a more robust system. Many inference servers, such as Triton and Ray, for example, have added integrations with vLLM.

Models

In this section, we will look at the following models: LLama2 and RedPajama.

LLaMA-2

LLaMA-2 is a family of LLMs released by Meta AI. It is available in three variants, each with 7, 13 and 70 billion parameters. While its architecture mirrors that of LLaMA-1, the foundational training for LLaMA-2 utilized 40% more data. Specifically, these LLaMA-2 base models were trained on a 2 trillion-token dataset and refined to exclude websites known for disclosing personal information. Additionally, Meta AI aimed to prioritize more reliable and trustworthy sources in the dataset.

RedPajama

RedPajama-INCITE combines Together.ai’s RedPajama dataset and EleutherAI’s Pythia model architecture to form an open source LLM. In our view, one interesting aspect of the RedPajama models family is that there is a 3B parameter size model, which is unusual in our experience. Together.ai reasons that this model size may allow for wider adoption due to smaller hardware requirements and easier experimentation. Together.ai introduces and uses the RedPajama dataset, which is a 1.2T token open-source replication of the LLaMa dataset. That is, they follow the same steps in terms of pre-processing and filtering, use the same data sources and extract roughly the same number of tokens from each.

During this benchmark, we wanted to see how model size influences inference performance. We examined various versions of these models, including:

LLaMA-2–7B
LLaMA-2–13B
RedPajama-3B
RedPajama-7B

We selected models fine-tuned using the LoRA method as we previously described in our series of blog posts about LLMs finetuning. Specifically, we were benchmarking models fine-tuned for the classification task on this dataset.

Hardware

We selected AWS as our primary platform for running instances, opting for the g5.4xlarge instance that comes with the Nvidia A10 GPU with 24GB of memory. However, due to a shortage of GPUs and the particular challenge of getting A100 GPUs on AWS, we also used Google Cloud. There, we successfully obtained instances equipped with the Nvidia A100 GPU, with 40GB of memory for a more robust comparison.

Benchmark

Latency and Throughput

When deploying any machine learning model, we believe it is important to consider the number of users a server can accommodate. If the server isn’t sufficiently optimized, it may struggle to handle consistent batches of incoming requests over extended periods. Thus, in our tests, we sought to simulate “attacks” on the server, mimicking real-world production scenarios. By simulating “attacks,” we could evaluate how effectively each server handles substantial workloads.

For this purpose, we conducted our benchmark tests using a tool named Vegeta. Vegeta is a versatile HTTP load-testing tool designed to target HTTP services with a steady stream of requests. With Vegeta, we were able to set the test’s duration (how many seconds a test runs), rate (number of requests sent each second) and dynamically provide different inputs using a JSON file. During a test, Vegeta would attack a specific server with a predetermined number of requests every second for the set duration, subsequently providing comprehensive metrics. Specifically, we focused on:

Latency (90%): The time it takes to get a response for most (~90%) of the requests. It means that nine out of 10 responses will be faster than this time.
Throughput: The number of requests a server can process in one second.

One of the challenges was finding the optimal duration and rate with which we were sending requests. In the optimal case, we were trying to send a certain number of requests for at least 10 minutes. When the server was crashing under such load, we reduced the duration value and, if necessary, the rate with which we sent the requests. We ran each test three times and found the average value for latency and throughput in order to make the benchmarking process more fair.

Benchmark results

For each of our requests, we picked random sentences from the Hugging Face dataset, trimming it to the length that would be equal to ~100 tokens and sending it to the server.

Text Generation Inference and vLLM

In general, TGI and vLLM implement the same range of features as the PagedAttention mechanism; therefore, their performance is pretty similar. It’s noteworthy that in our benchmark tests, the TGI service handled twice as many requests as the vLLM. For instance, when utilizing the same Nvidia A100 GPU, the TGI enabled us to process 40 requests per second for the Llama-7B model, whereas with vLLM, we could only manage 20 requests per second.

In the table shown below, we tried different combinations of GPUs and servers for deployment. The throughput and RPS (requests per second) value are similar, meaning that the overall time it takes for a request to be processed appears to be relatively small.

Benchmark results for Text Generation Inference

Benchmark results for vLLM

Ray with vLLM support

Triton Inference Server with vLLM backend

The performance for Triton Inference Server and Ray is comparable to the plain TGI and vLLM, where the server was able to handle 10 requests per second during 10 minutes.

This performance leads us to our opinion that using just the inference platform on its own is not enough to provide an ideal user experience. As noted during our previous benchmarks, we were not able to keep the server alive for more than 10 minutes while sending from 10 to 60 requests depending on the model size and hardware. The reason could be the lack of sufficient resource allocation, effective load balancing and an optimized runtime environment tailored for serving LLMs.

There can be different ways of building a more robust system. For example, in terms of Ray, Ray Clusters may be used to run Ray applications on multiple nodes. Ray also provides native cluster deployment support on AWS, GCP and Kubernetes.

In our experiments, we deployed some of the models on Amazon SageMaker endpoint using Hugging Face Deep Learning Container (DLC) for inference, which is powered by TGI. This way we could use the optimizations that come with TGI while also working to improve the performance with SageMaker’s managed service capabilities, such as autoscaling.

Hugging Face LLM Inference Container for Amazon SageMaker

We performed benchmarking using the popular load testing tool Locust. We adjusted the settings to create 10 users at a time, who would all send requests continuously without any breaks. With this setup, each user started sending requests. If the server and network were fast enough, we could get more than 10 requests per second, since each user might make more than one request in that time.

As a result, we were able to successfully send more than 10 requests per second to the server for the duration of one hour for LLama-2–7B on the AWS instance powered by Nvidia A10 GPU.

Cost

Cost calculation

In our previous blog posts, we tried to calculate the cost for inference based on the peak RPS, but this time, we decided to adjust our formula and link it to a throughput value (in this context, it is a number or processed tokens per second).

If you look at the explanations of the formula written below, you can see that the throughput value depends on the total number of requests the users sent (for the classification task, the model was generating approximately 6 tokens for the input of roughly 100 tokens):

Number of tokens processed = input tokens + output tokens = 100 + 6 = 106

Throughput (tokens / s) = Total number of requests * Number of tokens processed / Duration

Time to process 1K tokens (min) = 1K / Throughput (tokens / s) / 60

Cost to process 1K tokens per hour = Time to process 1K tokens / 60 * Instance cost (per hour)

The process of identifying the specific number of requests mirrored the approach we used during the benchmarking. In this setup, we had 10 users actively sending requests for an hour, with Locust adjusting the frequency of these requests based on the server’s response capabilities.

By using such calculations, we aimed to imitate the way the inference cost was calculated in this blog post written by the Hugging Face team.

Per the table below, the total number of requests for RedPajama 3B is bigger than for LLama-2–7B and RedPajama 7B, which means that deploying a model trained on fewer parameters may be cheaper ($0.0003 for RedPajama 3B compared to $0.0006 for RedPajama 7B).

We also compared the cost of classifying one million sentences (assuming each contains approximately 100 tokens) using our fine-tuned models with those of GPT-3.5 Turbo and GPT-4.

As shown in the bar chart presented below, the price is more than three times lower compared to GPT-3.5 Turbo and 63 times lower when compared to GPT-4.

Conclusion

In conclusion, here are the takeaways of our views based on our experiments.

Both TGI and vLLM demonstrate similar performance levels, exhibiting noticeable effectiveness where latency 90% is typically below one second in most scenarios.
We believe that Nvidia A100 is better than Nvidia A10 as it can serve twice as many requests at the same time when compared to the A10.
It appears that the smaller the LLM, the higher the number of requests it can serve. RedPajama 3B achieves 30 requests/second on TGI with Nvidia A10 GPU, compared to RedPajama 7B’s 20 requests/second with the same configuration.
Standalone inference platform appears to fall short in managing heavy user traffic, as seen from earlier load tests, such as TGI and vLLM. We were unable to process incoming requests beyond 10 minutes under moderate loads (likely due to insufficient resource allocation and load management).

In the end, we deployed some of our models on Amazon SageMaker endpoint with Hugging Face DLC and observed stable performance and ability of the server to handle a large amount of requests.

Regarding the cost, our calculations demonstrate that using fine-tuned models for classifying one million sentences, each roughly containing 100 tokens, is more cost-effective. Specifically, it costs over three times less than using GPT-3.5 Turbo and is 63 times more affordable than employing GPT-4 for the same task.

The codebase for serving options is available in our LLM-Fine Tuning-Hub.