Which startup can take the cash burning championship?

The Misleading Costs of Private vs Public Inference

Published in

TeraSky

6 min readJul 30, 2024

As an advocate of open and private LLMs, I crunched some numbers of private vs public LLM costs. The results didn’t make sense and were an eye opener of how much cash the AI startups are burning through. How long before subscription costs explode in the name of profitability? I was challenged by a colleague to compare costs of private and public inference in the UK. I made a lot of loose assumptions which clearly don’t seem fair to use in estimation, but in the end I was blown away how it wouldn’t even matter. The difference is astounding even with error margins. Feel free to check my math(s) in case I missed something.

2xGV100 32GB is still my favorite value inference rig at ~$3,000 in the UK.

Private Costs

I started by approximating the capital cost of hardware purchased to run inference on today’s top open source models. A decent GPU rig that can handle a quantized large model can easily cost $20,000+. I run a pair of older V100s with NVlink which cost about $3,000 pre-owned. I will calculate the costs for both extremes. Total cost of a rig like this could be $4,000–$30,000.

Electricity price by country 2023 | Statista

The cheapest electricity prices in the world could be found in Iran in 2023, whereas the cost of electricity per kwh…

www.statista.com

Second I calculated energy costs for nonstop inference around 30–40 tokens per second. This isn’t something I recommend doing in the UK, where electric rates cost about 3x the USA and are some of the highest in the world. There may be something to say for renewable sources and carbon neutral targets which I didn’t take into consideration. My rig uses about 2x230W in the GPUs and about 200-300W across the rest of the system. A metering device shows an average of about 700W under inference load.

$0.40/kWh * 0.7kW = $0.28/h

This is my estimate per hour. If I were to inference constantly for a year, it’s looking at up to $2,453 per year depending on peak/offpeak rates. Add the capital investment to a year of energy and the minimum I can expect to spend in one year excluding depreciation and failures comes to:

$6,453/year private

During this year my private rig can expect to infer a max of around 1,261,440,000 tokens (input/output) at a cost of around $0.000005116/token. This cost doesn’t include any of the advanced batch features or agents of a multimode LLM such as GPT 4o but we can estimate it to be approximately as useful in our business case as GPT 3.5+turbo.

To reproduce this all in the cloud I find the minimum of $14,460 using Dutch spot instances in GCP. Reserved instances may reduce this given we’re planning to calculate costs for a year but it looks like an on-prem solution will always come in much more affordable. Still cloud autoscaling provides a big bonus if scaling this out to a larger group.

Public Costs

GPT 3.5+turbo is less popular now GPT 4+ models are out and the costs are a bit different than tokens inferred privately. Currently GPT3.5+turbo costs $3 per million input tokens and $6 per million output tokens. This means your prompt input including any RAG sources are ingested at half the cost of prediction tokens coming out of the LLM. Private LLMs won’t differentiate the input/output costs here as processing a token is just more energy either way. Private Llama.cpp caching prevents a re-parse of previously ingested prompts during chat. I’m still unclear if this is the case in the closed GPT3.5 model but a thread seems to indicate not exactly.

Caching representations

Is there a way in the API to cache representations of (part of) a prompt, such that we don't need to run GPT3 on the…

community.openai.com

Using chat, I would estimate albeit arbitrarily 3/4 of tokens will be input as threads grow with each completion. This means an average cost of $3.75/million tokens. OpenAI has the added benefit of parallel load balancing whereas a single rig privately will only handle one completion at a time. Take that against our original token count for the year privately:

1,261,440,000/(1,000,000 * $3.75) = 1,261.4 * $3.75 =

$4,730.25/year public

This means to get anything close to my 1 year of private LLM inference out of a much more flexible and powerful GPT3.5+turbo subscription will cost just $4,730.25 public vs $6,453 private. Private costs will reduce subsequent years without the capital investment in hardware but still it seems clear a public LLM at the GPT3.5+turbo level is a better deal. Remember this is even calculating a bargain basement inference rig rather than a top of the line H100/GB100 server. Change that to an amazing model like GPT-4o and costs currently work out about double to $9,000/year. Privately it would take an exponential investment and lots of private development to get equivalent functionality, RAG, and agent support like GPT-4o.

What Gives?

I would love to say I could design a private infrastructure to help a business save on AI but the fact is public AI services are still cheaper and better than most local models when it comes to common knowledge and code completion. An exception may be a fine tuned specialist model that is tailored to your business need.

The fact is public AI services can provide such a great value because they are burning cash literally like there’s no tomorrow. Maybe AI armageddon is coming, right? OpenAI still provides top AI services at massive scale but they are rumoured to be burning at a $5B annual loss trying to build the market.

Report claims that OpenAI has burned through $8.5 billion on AI training and staffing, and could be…

Generative AI makes billions of dollars in revenue but more is needed to cover the operational costs.

finance.yahoo.com

There are a growing number of stories about cash burn rates at startups. Cloud costs are astronomical though the flexibility of cloud GPUs is critical for training. Training requires scale up while inference requires scale out. Companies competing for access to hardware will often be beat out of the market by the hyperscalers which order mass quantities before the silicon is even etched.

AI’s Unsustainable Burn Rate: The Harsh Realities of the Industry

AI’s Reckoning: The Mounting Losses and Uncertain Future of the Industry

medium.com

What is the answer for startups that are heading for a capital wall? Will subscription costs rise to hit profitability once we’re hooked or will costs come down? I suspect the latter. It’s insane how cheap public AI subscriptions are given the value. I expect multiple AI startups will fail in the next year as several already had. Those that have physical kit in their datacenter will have assetts to liquidate. Those that overbudgeted on cloud will have nothing but possible IP and training models.

Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover - forgotten cluster…

An engineer who worked at Twitter during the Agrawal-Musk transition has been reminiscing about finding an unused…

www.tomshardware.com

Conclusion

As a habitual DIY advocate, I never like to use a SaaS or product that I could manage on my own. I also deal with a lot of customers that can’t let data leak in a cloud account or an external AI service. That said there are plenty of cases where a public LLM does much more than a private LLM can, and it seems until the profit police come calling, public AI startups aren’t playing by the same cashflow rules as most of us. I love to see the innovation but I worry what happens when the piper calls. I’d love to hear your thoughts.

Useful simultaneous post.