Self-hosted LLMs: Are they worth it?

Published in

Pipedrive R&D Blog

6 min readMar 12, 2024

As you ride the AI wave by building functionality on top of Large Language Models (LLMs) using the likes of ChatGPT and Claude, it’s paramount to have a thorough understanding of how much they cost and how fast they are, as their performance characteristics are drastically different from usual enterprise software.

Applying common assumptions and patterns could lead to catastrophic costs alongside a severe degradation of user experience.

LLMs are machine learning models that are embarrassingly slow to run without specialized hardware (GPUs), with it taking minutes to answer a simple completion request.

While developers are used to expecting HTTP requests to take between 100 and 200 milliseconds, OpenAI’s Chat Completion API — which is serviced by thousands of GPUs that each cost almost ten thousand dollars — takes at least seconds to answer anything nontrivial. It is so slow that geographical latency doesn’t even matter.

This article provides an introduction to the intuition needed to build AI features, through a practical example of how much it costs to offer an LLM-based text summarization service to a non-trivial number of users.

We estimate the pricing and performance of using third-party vendors such as OpenAI versus self-hosting*, including the break-even point when it makes sense to ditch the former.

N.B*. See https://arxiv.org/pdf/2311.16989.pdf for a thorough scientific comparison of how OpenAI compares to what’s out there.

The best open-source LLMs already perform better than GPT-3.5-turbo on some standard benchmarks.

Text summarization in LLM terms

Summarization is a good first AI feature. Not only is the task straightforward enough to be described in a prompt, it is also easy to quantify how much time could be saved: If a summary is 25% the length of the original text, then the user might spend up to three times less time reading it.

Let’s start with some numbers. Assume that your users need 1 million summarization requests per day and that the average word count of each request is 300.

LLM providers — OpenAI et al — charge per input and output, e.g., the text that is “fed”, and the summary that comes out. The cost of self-hosting is different. It is nothing but the hourly cost of the instance serving the model — not counting personnel.

The standard unit of cost for providers is a token; between 1/2 to 3/4 of a word. When translated into tokens, the average request’s word count would be something close to 500.

We observed that asking LLMs to summarize a text produces output that is around 15–35% of the original size, hence we can extrapolate that the mean summary size could then be around 100 tokens long.

Another key driver of cost is the prompt. While the text to be summarized can vary, the prompt is fixed. As the prompt usually carries context, it can grow quite large. We arbitrarily assume it to be 400 tokens.

Our token cost per request is then 1000 (900 input and 100 output) tokens.

How much would it cost to use OpenAI?

OpenAI has an extensive model catalog. From our experiments, GPT-3.5 turbo was usually enough for any task.

The cost for GPT-3.5 turbo to summarize any input (at the time of writing) is 0.0005 for every 1K input tokens and 0.0015 for each 1K output tokens. Given the established token cost per request, the cost is: (0.0005 * 900/1000) + (0.0015 * 100/1000) = something in the ballpark of 0.00065 dollars.

Assuming 1 million requests per day with an average request size of 500 tokens, it would cost you 650 dollars, or 13 dollars per hour, a day. This is 237,250 dollars per year for a single feature. Given that this cost is demand-driven, for small-scale experiments it is ok to go this route.

Even for large-scale deployments, going with OpenAI might be worth it, despite the scary price tag. Assuming that the 1 million text summarizations are spread over 50K users, the cost per customer per day would be 0.013 dollars. Not bad.

The biggest benefit of using OpenAI is that it provides constant request latency (how long each customer has to wait) no matter how many other customers are requesting a summary at the same time.

This is not always true if you self-host, as too many requests could overload the machine you use to serve the model.

Self-host pricing

How much money would you save, if any, were you to go without a third party? In this case, cost is not dictated by tokens per second, but by expected service reliability. Establishing an accurate estimation requires thorough considerations for latency, i.e. How long are we willing to let our customers wait?

Let’s arbitrarily set a performance SLA dictating that the p95 waiting time (the 5% slowest periods at any given time) for a customer to get their summary shouldn’t be higher than 10 seconds. As a point of reference, summarization requests sent to GPT-3.5 had a p95 waiting time of 2 seconds.

We ran a quick test to see how feasible it was to uphold that SLA. The setup for our self-hosting experiment was a modest g5.xlarge instance on AWS, serving the most successful open source model we tested at the time, zephyr. The cost of that instance is 1 dollar per hour.

We measured that the maximum pressure at which keeping p95 below 10 seconds was feasible was something in the ballpark of 15–18 requests per second or around 14 million requests per day. The average waiting time for a summary was 5 seconds, 2.5 times slower than OpenAI.

Do you need more than a dollar to handle every summarization request?

Given that we have an exact number of summarization requests that need to be answered each day, we can estimate what the number of concurrent requests will be during the p95 busiest times.

If we assume that:

The number of requests outside working hours is negligible
Working hours (18 in a day) are defined as the union of all working hours from Los Angeles to Helsinki
Requests are independent of each other
The average number of requests per second is 15 (1,000,000 / 18 / 60 / 60)

Then we can use the Poisson distribution (depicted on the graph below) to estimate that in the top 5% busiest seconds you could expect to see 22 or more.

Poisson distribution with mean 15. The green dashed lines represent the vertical and horizontal components of the p95 number of summarizations per second

Let’s go with an optimistic scenario. If we have constant pressure of 25 requests per second for just one minute, how would our dollar store instance fare?

Our “waiting distribution” for that test was:

min: 7.3s
p50: 11s
p90: 19s
p95: 36s

Bad! It seems that one instance is not all that’s necessary to remain well under the SLA.

What if we use two instances with traffic evenly split between them then:

min: 1.3s
p50: 4s
p90: 7.5s
p95: 8.9s

Then it looks good! All while being significantly cheaper (17,280 dollars) than OpenAI.

What is the break even (B/E) point?

It’s unlikely that this feature will be released to all of your users at once, so at what point should we go self-hosted?

When using a LLM provider, pricing is based on demand. Having your own infrastructure has a limitation, however. There always has to be at least one machine running. The daily cost of the dollar instance is 24 dollars a day. As we saw that we need two, The B/E point is 48 dollars worth of summarizations.

As previously mentioned, the price of each summarisation is 0.00065 dollars. Self hosting beats OpenAI after 73,846 summarisations per day.

The relationship of cost per day versus requests per day for both OpenAI and self-hosting

And how many customers-worth-of-one-summarization is that?

Every workday, 50K unique users send out all 1,000,000 requests, which translates to a B/E number of customers of around 3,692.

Conclusion

Choosing whether to self-host LLMs or go with AIaaS providers such as OpenAI involves a thorough evaluation of costs, performance and reliability.

At the very least, you need to take into account the expected token count per request, request throughput, per token cost and your accepted request SLAs.

While self-hosting offers large potential cost savings and competitive performance for modest-volume applications, It does demand an already-existing infrastructure team and a proper understanding of how much reasoning your task requires.

In both scenarios, with or without an infrastructure team and AI experts, we recommend starting with an AIaaS provider. If you see that your task does not need complex reasoning on the likes of GPT-4, then analyze the break-even point to ensure that the transition from AIaaS to self-hosting will happen in a timely manner.