Cost Analysis of deploying LLMs: A comparative Study between Cloud Managed, Self-Hosted and 3rd Party LLMs

Published in

Artefact Engineering and Data Science

7 min readOct 30, 2023

TL;DR

In this guide, we estimate the cost of deploying an LLM. We take into account 3 solutions (3rd party, cloud-managed, and custom) on a conversational agent. We show that for a moderate number of conversations (≈8k< conversations per day), the solution managed by cloud providers is the cheapest while requiring less infrastructure monitoring.

Introduction

For the past year, Large Language Models (LLMs) have been at the center of attention due to their promises to transform our way of working and their business opportunities. At Artefact, generative AI has been set as the top priority of the company to help our clients leverage this technology. We’ve worked on prototyping and putting in production several use cases like Conversational Agents, Summarization, or Extraction tasks. The results have been promising by bringing value to each company. However, the cost of using such an asset remained unclear for several actors leading us to write the following cost analysis. In this article, we will compare the cost of 3 LLMs deployment solutions (3rd party, cloud-managed, and custom) for a conversational agent.

Note: The prices are arbitrarily set by the actors and are subject to fluctuation. We have retrieved them as of October 2023. The currency is always the dollar.

Project Scoping

As with any Data Science project, we need to clarify and understand the end-user’s needs. This also applies to generative AI projects and is sometimes overlooked. Based on this phase, changes will be made to the form and content of the developed use case. Regarding the cost, there are 2 things to take into account:

What is the frequency of usage of the tool by the end-users?
What is the unit of measure to run your cost analysis?

In this article, the unit of our analysis will be a conversation. One could apply the same logic to an extraction agent where the unit of measure is an unstructured document.

A conversation is here defined by:

5 inputs by the end-user per Conversation
An average of 150 words typed by the user in Input
An average of 300 words generated by the model in Output

Deployment Solutions

3rd Party Solution

This method relies on an API call to access for example OpenAI’s models (GPT 3.5 & GPT 4). When developing your application, you’ll transmit the prompt to the OpenAI API and await the model’s response. You can directly use the Python library of OpenAI or if you need a bigger level of abstraction libraries like Langchain.

You are billed based on the number of tokens. A rule of thumb is generally that 1 token is roughly equal to 3/4 of a word. You can obtain the exact number of tokens using the Tiktoken Python library. The pricing is done in 2 stages:

The number of tokens sent as Input.
The number of tokens generated in Output by the model and sent back to our application.

Note that GPT 4 is 20 times more expensive than GPT 3.5. Also, the generation price is twice the input price for GPT 4 while we only notice an increase of 33% for GPT 3.5.

Managed LLM by a Cloud Provider

The second option is to use the different LLMs available on the cloud provider used by your infrastructure. For this example, we will focus on Google Cloud Platform (GCP). OpenAI’s models are not accessible on GCP but we will compare them with Palm-2. Their performance is not equivalent in all use cases but both options are well-suited for a conversational agent. If you wish to use GPT’s models, you will need to create an Azure project and deploy an instance. The pricing is similar to the direct use of API. The main difference with the first solution is that you have the same guarantees and benefits as any other ML model previously used on your cloud project. Also, you do not introduce a 3rd party actors into your infrastructure. To use Palm-2, you can use the Python suite from GCP or once again Langchain modules.

Once more, the pricing of this method is done by splitting Input and Output.

Notice that the unit is now the 1000 characters and not the 1000 tokens. We have 1 token ≈ 3/4 word ≈ 4.5 characters leading us to :

Palm-2 is within the same magnitude as GPT 3.5 and so far cheaper than GPT 4.

Custom LLM hosted on a Cloud Provider

Lastly, you can use OpenSource models weights such as Llama-2, or Mistral-7b to run directly the inference. You will have total control over your model but it will require you to maintain it and also to deploy it manually. This method is far more complex than the 2 previously presented. The idea is to use an Endpoint where you will call the model when needed. We will focus on a VertexAI Endpoint from GCP. You can have a look at this published notebook by Google to deploy Llama-2.

Here, the billing is not based on the number of tokens/characters. The billing rests upon the period you book a specific hardware to run inference on the model. The larger the number of weights, the more performant hardware required and thus the higher the cost. As an LLM contains billions of parameters, you will need at least a CPU and a GPU. For Llama-2–7b, we used an N1-standard-16 Machine with a V100 Accelerator deployed 11 hours daily.

Running efficient models with billions of parameters is an active field of research. A lot of progress is being made on how to efficiently use hardware to infer those models. One promising approach, vLLM, intends to better use the RAM by effectively managing attention keys and values. Reducing the storage dedicated to them will lower the hardware prerequisite and thus reduce the cost while ensuring the same quality level.

Note: on a VertexAI Endpoint, as you are not billed on the usage but only on the time spent, undeploy the model as soon you do not need it.

For GCP, the cost is computed as follows and depends on the deployment region:

Retrieve the Machine specs (number of vCPUs/Memory(GB)) here
Retrieve the SKU pricing of vCPU/hour and GB/hour (vCPU cost, RAM cost) here
Retrieve the cost per hour of your accelerator (GPU cost) here
Apply the following:

For example: N1-standard-16 has 16 vCPUs and 60 GB. In Europe-West4, the cost is $0.0421268 per vCPU/hour and $0.0056373 per GB hour. In addition, the V100 costs $2,9325 per hour. It leads to a cost of $3.9472668/hour.

Cost Analysis and Comparison

Now that we know how to compute for each solution the price, we can compare them.

Comparison in function of time

Over time, the cost of having a custom endpoint is linear as the hardware price is the same for each day. For the solution where we are making calls, it will evolve in function of the number of conversations per day. The number of conversations is randomly chosen between 500 and 2000 each day. GPT 4 is by far the most expensive solution, followed by Llama-2–7b deployed on a VertexAI Endpoint. Lastly, the managed Palm-2 and GPT 3.5 are much more affordable at less than 10$/day.

Comparison in function of the number of conversations

The next thing we want to compare is how many daily conversations are required for the in-house custom model (Llama-2) to be cheaper than managed models (GPT-4, GPT-3.5 and Palm-2) ?

In other words, what is the threshold of conversations required where it is beneficial to self-host the model ? Our results showed:

Threshold of conversations per day where the cost of infrastructure (self-hosting) is lower than the cost of API call (cloud-managed/3rd party)

You need more than 8 000 conversations per day to see the cost of having a relatively small model hosted on your infra surpassed the managed solution by the Cloud Providers. Furthermore, this is a low bound as you would probably need more hardware to run parallel inference of your model when the demand surges.

Conclusion

To conclude, we have seen 3 different methods of deploying LLMs on an application alongside their cost. Plus we have compared them in function of time or the number of conversations. What we found is that cloud providers like GCP offers a suitable solution for middle-size application and answer the tradeoff between cost and ease of deployment.

Lastly, cost is one of the many dimensions to take into account when deploying a LLM. Further work could focus on what to choose when it comes to quality, scalability, or confidentiality.

Source:

3rd party/Managed solution:

Custom Section:

Increase inference time by efficiently allocating memory: vLLM