A Broke B**ch’s Guide to Tech Start-up: Choosing LLM — API Prices

Soumit Salman Rahman
Before You Launch
9 min readApr 20, 2024

--

So, what model should I use?

Well, that depends …

“That depends” — yes, the mother of all answers. It depends on a bunch of things. But when you are a broke start-up founder it depends a whole lot on the pricing.

For smaller open-source models (e.g. summarization, embedding etc.) I would recommend hosting them as part of your app since a general-purpose container + CPU can do the job just fine without breaking your bank. For example, I run the following as Azure Container App for less than $3 a month.

For larger models that’s not always feasible. Maybe you are still trying to figure things out, maybe you are not ready to investment in dedicated GPU deployment yet, maybe you don’t know your production workload yet. Whatever the reason is, when you are starting out to build an initial alpha/re-alpha for GenAI app, I am a fan of using already hosted serverless endpoints. To me they are low initial cost, low commitment and easily switchable.

For this article, I will focus on text based LLMs and the one thing that start-up founders sweat beyond all means: operating cost. I am assuming that you have a general idea of capabilities of the most commonly known LLMs and it is not a decision factor right now.

For serverless endpoints, LLM service providers (like OpenAI, Anyscale etc.) charges you by the number of total tokens (roughly 0.75 words) where as dedicated endpoints tend to charge based on compute time. We will base all our costing per 1 million tokens. To give you an idea:

  • 1000,000 tokens is roughly equivalent to 1000 pages full of text.
  • That is Dune part 1 and part 2 books together. Or 2 Lord Of The Rings books together.
  • That is 2.5 days of non-stop reading: no sleeping, no eating, no pee-pee break (You can have a bathroom break but you just have to keep reading while you are doing your bidnez).
  • That is like reading 5000–6000 news articles with ~200 articles every day for a whole month straight.

That’s a lot of words cuh!!

Chat & Instructs

For base models, some platforms charge you separate rate for input and output, which output tokens tend to be of higher cost. While some other average out the cost and charges you a flat rate.

Proprietary-ish Models:

Technically, Cohere just open sourced their models so these are not proprietary anymore, but anyhow. From the evaluation results GPT-4, Command-R and Claude 3 Opus are in the same weight class. So the price varies significantly from their lower weight siblings.

+--------------+-----------------+--------------------+---------------------+------------------+
| Platform | Model | Input: $/1M Tokens | Output: $/1M Tokens | Avg: $/1M Tokens |
+--------------+-----------------+--------------------+---------------------+------------------+
| OpenAI | GPT 4 Turbo | 10.00 | 30.00 | 20.00 |
| OpenAI | GPT 3.5 Turbo | 0.50 | 1.50 | 1.00 |
| Cohere | Command-R | 0.50 | 1.50 | 1.00 |
| Cohere | Command-Light | 0.30 | 0.60 | 0.45 |
| Athropic | Claude 3 Opus | 15.00 | 75.00 | 45.00 |
| Athropic | Claude 3 Sonnet | 3.00 | 15.00 | 9.00 |
| Athropic | Claude 3 Haiku | 0.25 | 1.25 | 0.25 |
| Athropic | Claude Instant | 0.80 | 2.40 | 1.60 |
+--------------+-----------------+--------------------+---------------------+------------------+

Open-source Models:

There is whole bag full of open source models in HuggingFace however Mistral models seem to be the only ones that are going toe-to-toe with the proprietary models. There are a number of Platforms that offer serverless inference endpoints for Mistral-7B, Mixtral-8x7B (Mistral’s mixure of experts model) and the newest Mixtral-8x22B.

+---------------+--------------+--------------------+---------------------+------------------+
| Model | Platform | Input: $/1M Tokens | Output: $/1M Tokens | Avg: $/1M Tokens |
+---------------+--------------+--------------------+---------------------+------------------+
| Mistral 7B | DeepInfra | 0.10 | 0.10 | 0.10 |
| Mistral 7B | Anyscale | 0.15 | 0.15 | 0.15 |
| Mistral 7B | OctoAI | 0.10 | 0.25 | 0.18 |
| Mistral 7B | Fireworks.ai | 0.20 | 0.20 | 0.20 |
| Mistral 7B | Together.ai | 0.20 | 0.20 | 0.20 |
| Mistral 7B | Mistral | 0.25 | 0.25 | 0.25 |
| Mixtral 8x7B | DeepInfra | 0.27 | 0.27 | 0.27 |
| Mixtral 8x7B | OctoAI | 0.30 | 0.50 | 0.40 |
| Mixtral 8x7B | Anyscale | 0.50 | 0.50 | 0.50 |
| Mixtral 8x7B | Fireworks.ai | 0.50 | 0.50 | 0.50 |
| Mixtral 8x7B | Together.ai | 0.60 | 0.60 | 0.60 |
| Mixtral 8x7B | Mistral | 0.70 | 0.70 | 0.70 |
| Mixtral 8x22B | DeepInfra | 0.65 | 0.65 | 0.65 |
| Mixtral 8x22B | Together.ai | 1.20 | 1.20 | 1.20 |
| Mixtral 8x22B | Mistral | 2.00 | 6.00 | 4.00 |
+---------------+--------------+--------------------+---------------------+------------------+

Note that these are NOT the only models these platforms offer. I just chose my favorite ones.

Leaderboard:

  • 🏆Overall pricing wise DeepInfra wins by a loooong shot !
  • 👏 For heavy weight models, if you want something versatile go with OpenAI. GPTs are still my favorite models when it comes to reliability, consistency and instruction following .
  • 👌If you don’t need a code generation support Cohere Command-R is solid for the pricing.
  • 😞All Anthropic models come with significantly large context window (100K tokens) which is great legal bots but for my product scenarios, given the price GPT 3.5 and Command-R can wipe the floor with any Anthropic models.

Text Embeddings

Whether you are doing RAG or symantic search or content classification text embeddings are pretty much integral part of your GenAI app. Depending on what you are trying to do and your input you can either choose a model with large context window (and generally low resolution) or small context window (and generally higher resolution). One thing to keep in mind that, smaller context window does not necessarily mean that the models are smaller e.g. baai/bge-large-en-v1.5 is a pretty large model that can take half a page of context and generates high dimension vector to help higher precision symantic search. On that note: OpenAI is no longer the only platform supporting large context window for embeddings. We got more players in the field 🏈.

Large Context Window (4K — 32K tokens / 5–40 pages):

+--------------+----------------------------+----------------+-------------+
| Platform | Model | Context Window | $/1M Tokens |
+--------------+----------------------------+----------------+-------------+
| Together.ai | m2-bert-80M-32k-retrieval | 32768 | 0.01 |
| Together.ai | m2-bert-80M-8k-retrieval | 8192 | 0.01 |
| Fireworks.ai | nomic-embed-text-v1.5 | 8192 | 0.01 |
| OpenAI | text-embedding-3-small | 8192 | 0.02 |
| Jinaai | jina-embeddings-v2-base-en | 8192 | 0.02 |
| Mistral | mistral-medium | 8192 | 0.10 |
| Nomic | nomic-embed-text-v1.5 | 8192 | 0.10 |
| Voyage AI | voyage-2 | 4000 | 0.10 |
| Voyage AI | voyage-large-2 | 16000 | 0.12 |
| OpenAI | text-embedding-3-large | 8192 | 0.13 |
+--------------+----------------------------+----------------+-------------+

Small Context Window (0.5K tokens / half a page):

+--------------+--------------------------+----------------+-------------+
| Platform | Model | Context Window | $/1M Tokens |
+--------------+--------------------------+----------------+-------------+
| Together.ai | google/bert-base-uncased | 512 | 0.01 |
| DeepInfra | baai/bge-large-en-v1.5 | 512 | 0.01 |
| DeepInfra | thenlper/gte-large | 512 | 0.01 |
| Together.ai | baai/bge-large-en-v1.5 | 512 | 0.02 |
| Fireworks.ai | WhereIsAI/uae-large-v1 | 512 | 0.02 |
| Fireworks.ai | thenlper/gte-large | 512 | 0.02 |
| Anyscale | baai/bge-large-en-v1.5 | 512 | 0.05 |
| Anyscale | thenlper/gte-large | 512 | 0.05 |
| OctoAI | thenlper-gte-large | 512 | 0.05 |
+--------------+--------------------------+----------------+-------------+

Leaderboard:

  • 🏆 No matter what you are trying to do and which ever scenario you are trying to address, Together.ai smokes everyone else to ashes.

I personally recommend running embedding model locally with your storage service to reduce latency and cost. This is specially helpful if you are using a local/in-memory vector database like ChromaDB. But given Together.ai’s pricing, the cost is almost cheaper than the compute cost of running embedding locally.

Fine Tuning

To me OpenAI’s GPT 3.5 was the first base model that just worked, and it works great off the bat. Not all models are like that and as you start picking up users and improving your product you will need to fine tune whatever base model you are using. However, when it comes to costing, fine-tuning has 4 different cost factors —

  1. Cost per run: Some (not all) platforms charge a fixed cost per run instance during training irrespective of the “volume” of training.
  2. Cost per token during training: This is similar to cost per token for using the base model but usually with higher rate.
  3. Cost per token during usage — Input: Using a fine-tuned model hosted on the same platform may cost you higher than using the base model (even though the model size doesn’t change).
  4. Cost per token during usage — Output: Similar to the one above but for output.
+---------------+--------------+-----------------+----------+-------+--------+
| Model | Platform | Per Run (Fixed) | Training | Input | Output |
+---------------+--------------+-----------------+----------+-------+--------+
| Mistral 7B | Together.ai | 0 | 0.1 | 0.2 | 0.2 |
| Mistral 7B | Fireworks.ai | 0 | 0.5 | 0.2 | 0.2 |
| Mistral 7B | Anyscale | 5 | 1 | 0.25 | 0.25 |
| Command-Light | Cohere | 0 | 1 | 0.3 | 0.6 |
| Mixtral 8x7B | Fireworks.ai | 0 | 2 | 0.5 | 0.5 |
| Mixtral 8x7B | Anyscale | 5 | 4 | 1 | 1 |
| Mixtral 8x7B | Together.ai | 0 | 5 | 0.6 | 0.6 |
| gpt-3.5-turbo | OpenAi | 0 | 8 | 3 | 6 |
+---------------+--------------+-----------------+----------+-------+--------+

Observations:

  • None of the platform seem to provide fine-tuning-as-a-service for the heavy weight models e.g. OpenAI/GPT 4, Mixtral 8x22B
  • Anthropic does not allow fine-tuning for any of its models.
  • OctoAI and DeepInfra does not to offer fine-tuning-as-a-service for text-generation models however they do for media-generation.
  • Anyscale charges a static fee of $5 per run no matter now minimal your training data is.

Leaderboard:

  • 🏆Fireworks.ai and Together.ai are near tie here. But given the speed of Fireworks.ai’s endpoints, they get the trophy.

The Freebies

I like my free $hit. This is like going to a black-tie event and getting lit up on free tasting-wine. Its great!

DALL-E is undefeated 😆

When you sign up, pretty much all platforms give you some initial free credit to get your playground started. $10 dollar doesn’t seem like much but I can tell you its a LOT! The credit seems to apply across the board no matter which model you use through the APIs. One thing to keep in mind that, billing of API access is separate from the web chat functions that platforms like OpenAI and Anthropic offer.

+--------------+------------------+
| Platform | Free Credit $ |
+--------------+------------------+
| Cohere | 75 |
| Nomic | 50 |
| Together.ai | 25 |
| Anyscale | 10 |
| OctoAI | 10 |
| OpenAI | 10 |
| VoyageAI | 5 |
| Anthropic | 5 |
| DeepInfra | 1.8 |
| Fireworks.ai | Free for 2 weeks |
+--------------+------------------+

Leaderboard:

  • 🏆Cohere — I 😍 you! Given that their models are significantly cheaper compared to the other models of the same weight class, $75 is a whooping number.
  • 👏Nomic — given the space they operate in (text-embedding and re-ranking) this is a solid deal.
  • 👌 DeepInfra gets honorable mention. They are 2–6 times cheaper than the other platform for the same models. Even though their free credit is low in $ value it goes further than it appears.

Keep in mind that

  • These are not the ONLY platforms out there that provides serverless endpoint for LLM.
  • The platforms mentioned here offer other models that I did not mention.
  • Price changes all the time and so do the models they offer.
  • Price is an important factor but the initial free tier comes with rate limits so do not use them for production.

PS

--

--

Soumit Salman Rahman
Before You Launch

Cyber Security Leader, Chaos Junkie | Recovering Big Tech