The Price of Intelligence — Understanding the Cost of Using LLMs
From the desk of Apoorva Joshi, Senior AI Developer Advocate @MongoDB
As an AI Developer Advocate at MongoDB, I spend a lot of my time experimenting with state-of-the-art (SOTA) models and reading about the latest developments in AI. So this month, I’m piloting a monthly series where I share learnings from my experiments and/or my take on things I find particularly exciting in AI.
My theme for this month has to be the cost of using large language models (LLMs). This month, I spent a good portion of my time researching embedding models, and given that 14 GB and 93 GB large models are currently at the top of the Massive Text Embedding Benchmark (MTEB) leaderboard, I ended up trying some of these. Soon after, Cohere came out with its 35 billion parameter Command-R model, optimized for retrieval-augmented generation (RAG) and tool use, and xAI released Grok-1, its 314 billion parameter Mixture-of-Experts model. So it only makes sense to talk about what it takes to run these models if you are not Mark Zuckerberg and cannot afford to buy 350k H100 GPUs!
In this article, we will cover the following:
- The cost of running open-source LLMs
- The cost of using proprietary LLMs
- Reducing costs associated with LLMs
The cost of running open-source LLMs
You hear a lot of reasons why people want to use open-source LLMs over closed-source. Some of the common ones are data privacy and customization, but there’s also a prevalent misconception that open-source models are “free.” While you are not paying a third party for token usage, you are responsible for the infrastructure costs when it comes to open-source models, and these, depending on the model you choose, are not trivial.
A common use case for LLMs in AI applications nowadays is to generate embeddings for RAG. So let’s say you wanted to use an open-source LLM instead of a proprietary model to generate embeddings for your RAG application.
Here are the top five models currently on the MTEB leaderboard:
All except one (voyage-lite-02-instruct) of the top five models on the leaderboard are open-source models, but look at the model sizes — one of the GritLM models is a whopping 93 GB! The model size tells you how much VRAM (memory available on a GPU) you would need to load the model weights. For shorter text inputs (less than 1024 tokens), the memory requirement for inference is dominated by the memory requirement to load the weights.
If you are generating embeddings for RAG, you are likely working with short pre-chunked text, which means the model size roughly represents how much VRAM is required to use the model for inference. Running LLMs on a CPU or a combination of CPU and GPU is too slow to be useful. So let’s also assume that you want to use only GPUs.
To give you an idea of GPU costs, here’s what it costs to run NVIDIA Tesla V100 GPUs on Google Cloud Platform (GCP):
Given the above numbers, it would cost $59.52/day to run the 14 GB models and $476.16/day to run the 93 GB GritLM-8x7B model at their default precisions.
Just because a model is at the top of the MTEB leaderboard doesn’t mean it is the best for your use case and data, especially if it comes at a high cost for incremental gains. I talk more about this in my tutorial on choosing embedding models — always evaluate a handful of models on your data with an eye toward costs.
The cost of using proprietary LLMs
While there is a mix of open-source and proprietary LLMs in the “best text embedding model” category, proprietary LLMs dominate in the chat completion arena. So it is no surprise that most of us are willing to pay the API costs to use these models in our chat applications. The numbers might not sound like much while prototyping, but they can add up when using these models on a larger scale.
Again, let’s consider RAG since it is a popular use case at the moment. In RAG, documents are typically chunked before adding them to a knowledge base. Given a user query, relevant chunks are retrieved from the knowledge base and passed along with the query and prompt as context for the LLM to generate an answer to the question. There is usually an embedding step that enables semantic retrieval from the knowledge base. Let’s assume you used an open-source model with manageable costs for this, and let’s calculate the cost of using a proprietary chat completion model for the rest.
Our in-house AI and Search Specialist, Pat Wendorf, conducted a thorough analysis of how much it costs to use some of the most popular proprietary models for chat completion in RAG use cases. The study makes the following assumptions:
Most proprietary LLMs are priced based on the number of input and output tokens used for completion. Assuming the above numbers, the cost (in US dollars) of using some of the latest SOTA proprietary models is as follows:
As you can see above, using closed-source models can be expensive, depending on the throughput of your system. Also, note that the above numbers are for a single LLM chat completion component. However, a system could use LLM components at multiple stages, such as prompt re-writing and compression, evaluation, function calling, etc. In that case, similar calculations would need to be done for each component to calculate the total cost for the system.
Reducing costs associated with LLMs
Strategies to reduce costs look different for open-source vs proprietary LLMs. With open-source models, if the performance of models being compared is not significantly different, go for the smallest model. If, for some reason, you need to use a large model, you might try reducing the model size to save costs. Knowledge distillation is a technique to transfer knowledge from a larger (teacher) model to a train smaller (student) model while maintaining validity. You can also apply post-quantization techniques to create lower precision (float8, int8, etc.) representations of the model weights and/or activations, thus reducing the model size. Both of these techniques optimize for inference while retaining as much of the base model’s accuracy as possible.
With proprietary models, costs can be reduced by either making fewer calls to the LLM or reducing the number of tokens passed as input to the LLM. One way to reduce the number of LLM calls is to use semantic caching on user queries so that a cached response is returned for semantically similar queries. Prompt compression is gaining popularity as a technique to reduce the number of LLM input tokens. It uses a small, well-trained language model to identify and remove non-essential tokens from prompts, achieving, in some cases, up to 20x compression with minimal performance loss.
Conclusion
In this article, we looked into costs associated with using open-source as well as proprietary LLMs. Contrary to common misconception, there can be significant infrastructure costs associated with running open-source LLMs. In the case of proprietary LLMs, costs are based on input and output token usage. Knowledge distillation and post-training quantization are techniques to optimize open-source models, while semantic caching and prompt compression can help reduce proprietary model costs.
To read more about my experiments this month, see my tutorial on how to choose the right embedding model. I would also love to hear from you about what you are building, so come join us in our Generative AI community forums!
Finally, if you found value in this article and would like to hear more from me:
See you next month!