Embeddings in production: or how nothing scales like you’d expect it to. Part 1 — costs to embed.

John-David Wuarin
Barnacle Labs
Published in
12 min readSep 25, 2023

TL;DR: If you already know about embeddings and just want to get to the bit where we look at the different options and their real costs, skip to the bottom where you’ll find a table with some interesting numbers. For our use case those costs varied from $17,600 to $45 — so it’s definitely worth doing some homework if you’re working with embeddings at scale!

Introduction

You might have heard a lot about embeddings, as they are one of the cornerstones of the latest AI cycle. They are getting used everywhere from search to recommendations and more. In particular, lots of enterprises are exploring the use Large Language Models to query their private data — the so-called “Retrieval Augmented Generation” pattern, which relies heavily on embeddings.

Some background

Embeddings are simply vector representations of something (anything really, but we will mostly focus on text here). What we want to do is to be able to represent words in a numerical manner, such that they could be compared to one another. In the past, models such as Word2vec and GloVe were developed and performed quite well for this task. However, they generate “static” word embeddings. This means that a word like Apple will always be represented by the same vector. This is somewhat of a limitation as Apple might mean something completely different based on the context it is in (e.g. the Apple fruit, or Apple the computer company). Wouldn’t it be nice to be able to have embed that contextual information as well? This is where transformers, mostly BERT, emerged. These were developed precisely with this in mind. Today, virtually all embedding models used are some flavour of BERT. I won’t dig into any of the aforementioned models as this isn’t the point of this post, but all that’s important to understand is that using such models, we are able to map each word to a vector that will look something like this:

The vector is just a big array of numbers that represent the meaning of the words being embedded in a mathematical manner.

For transformer-based models, the word embedding will look different based on its context. So the word Apple will have a different embedding, depending on if it’s used in the context of talking about fruit or computers.

The nice thing about these embeddings is that the models used to create them have been designed such that they preserve semantic and syntactic relationships. This means that words with similar meanings and in a similar context should have embeddings that are close together. In the same manner that words can be “embedded”, models can be used to embed full sentences. Sentence-BERT would be the seminal model/paper to look at to get a more thorough understanding (paper, library). Again, the point here is that full sentences can be embedded and the semantic meaning encoded into vectors that are relatively small — typically a few hundreds of dimensions.

Why do we care about all this?

As mentioned earlier, embeddings can power semantic search, advanced recommendations, clustering and much more. Once we have words transformed into vectors, we can do mathematical operations on them which opens up a whole world of possibilities. Embeddings have been quite the craze these past few years and the ecosystem around embeddings has grown at breakneck speed. The purpose of this post is to examine the production implications, so let’s take a look at what deploying a production solution will entail.

For this purpose we will take a small example problem, a dataset of 1M documents that are each on average 44 chunks long — each chunk is about 1000 token (we won’t focus too much on which tokenizer was used as the same one was used throughout). We will be embedding each 1M documents — things can (and do get much larger) but this is a reasonable size to start with.

There are two main steps and thus categories of tools that need to be used in order to use embeddings. Those are the “embedders” and the “storage/retrievers” — in this post, we will focus on embedders only.

Embedders

These are essentially models that can be used to embed data (we’ve been focusing on text so far, so we’ll stay on that). Here, the options you have at time of writing are essentially twofold:

  1. Either use OpenAi’s API to get embeddings (which most “getting started” posts tend to use simplicity reasons), or
  2. Use some other model or platform, of which there are many.

When it comes to how good the various models are, I’ll refer you to: https://huggingface.co/spaces/mteb/leaderboard. Do note, however, that the OpenAI ada embedder ranks 13th only on this list at the time of writing (September 2023).

Let’s first look at how OpenAI performs in terms of cost and speed.

OpenAI

Firstly, you can read more about OpenAI and how to get access to their embeddings API etc here.

For now, let’s focus on their latest model’s performance, price etc.

  • name: text-embedding-ada-002
  • price: $0.0001 per 1000 tokens
  • output dimensions: 1536 — this will be important for one of my subsequent posts, when we look at price to store.

Great! What does this actually mean for our example?

  • Response time: Because OpenAI is running multiple servers in parallel, you get the advantage of being able to simulate having deployed a swarm of servers that all compute embeddings for you in real time at a relatively low cost. However, because everyone else who is using OpenAI is also hitting those same servers, the response-time variability is very high. I’ve had anywhere from 4.5 to 40.0 seconds response time for that. The high variability might be an issue for products that require real-time embedding of documents.
  • Price: as seen above, 1000 tokens cost us $0.0001. For one document of 44 chunks of 1000 tokens each, it’ll be 44 x $0.0001 = $0.0044.
    👉 For 1M documents with an average of 44 chunks of 1000 tokens each: 1M x 44 x $0.0001 = $4,400
  • Time to embed all: We’d also like to know how long it will take us to embed all of these documents. Let’s first take a look at rate limits. 48hrs after becoming a paid user, the limits are set at 3,500 Requests per minute AND 350,000 tokens per minute.
    👉 If we had all documents at hand, could iterate through them at exactly 350k per minute, embedding all our tokens would take us: 1,000,000 x 44 x 1000 / 350,000 = 125,714.285 minutes or 87 days

Other solutions

First we’ll look at other products that provide off-the-shelf (or almost) APIs that can be leveraged.

Cohere

Cohere provides, very much like OpenAI, off the shelf models that can be used to create embeddings.

  • Models can be fine-tuned as per: https://docs.cohere.com/docs/finetuning. Their default generalist models are:
    - embed-english-v2.0 (default) — 4096 dimensions
    - embed-english-light-v2.0–1024 dimensions
    - embed-multilingual-v2.0–768 dimensions
  • Pricing: Whether you’ve fine-tuned a model or not, the pricing remains the same with Cohere: $0.0000004 per token Or $0.0004 per thousand.
    👉 For all our documents: $17,600 — 4x the price of OpenAI!
  • Time to embed all: Rate limits are set at 10,000 calls per minute — Each call will have 1000 token.
    👉 That’s quite a bit faster at 10,000,000 tokens per minute. So that’s 4,400 minutes or about 3 days, which is quite a lot quicker than OpenAI’s 87 days.

Google Vertex

Google offers an API via Vertex for access to their Gecko embedding models. The output is a 768 dimensions vector.

  • Pricing: https://cloud.google.com/vertex-ai/docs/generative-ai/pricing
    For some reason, Google decides to price per character, rather than per token. They do so at: $0.0001 / 1k character input. Using the accepted estimate that 1 token is about 4 English characters, we go and multiply that price by 4.
    👉 We get $0.0004 / 1k tokens. This is the same price as Cohere, $17,600 for our 1m documents.
  • Time to embed: According to: https://cloud.google.com/vertex-ai/docs/quotas, there is a 600 requests per minute rate limit. That means (assuming we’re still chunking by 1000 tokens, or about 4000 characters, that we have to send separately in each request) that we are embedding 600,000 tokens per minute.
    👉 That’s 73,333 minutes or 50.9 days.

Replicate

  • Website: https://replicate.com/
  • Pricing: https://replicate.com/pricing
    With Replicate, you have to pick a machine or set of machines you want to run things on. In full disclosure, I haven’t tried all the machines to test performance, as I’m not sure I wanted to take the risk of running 8x Nvidia A40 (Large) GPUs and forget that it’s running, which might create quite a large bill! Anyway, for rough estimates, let’s look at the lower end of the scale first. As any model can be run on the machines, I’ve taken a BERT-based model I’ve been playing around with: https://huggingface.co/allenai/specter2_base — the embeddings are 768 dimension long.

Checking out the smallest options:

  • 1 CPU instance (with 4x vCPUs and 8GB RAM): It takes on the lower end of things about 44s to embed the 44 chunks (or 1 per second). This would thus take us 44,000,000s or little over 509 days to embed all the documents at a cost of $0.0001 per second.
    👉 So that would be: $4,400. A bit long, but not more expensive than OpenAI.
  • 1 Nvidia T4 GPU: It takes on average about 6.5s to embed the 44 chunks of 1 document. This would thus take us 6,500,000s or little over 75 days to embed all the documents at a cost of $0.000225 per second.
    👉 So that would be $1,462.5.

It’s interesting to note here that the overall price to embed goes down as the machine gets better — one of those rare cases in life where a faster engine works out cheaper.

Please note, I haven’t actually tried these machines on Replicate, so your experience might differ. I am porting the performance from what I’ve experienced by running the models on Hugging Face. See below:

Hugging Face

Hugging Face is not just the largest repository of AI models out there, as one of their solutions is the Inference endpoints. This allow us to deploy AI models on a managed infrastructure, providing an API with almost no effort (the below are AWS via Hugging Face numbers).

Here’s see how things stack up:

  • 1 machine with 8 vCPUs and 16GB of RAM: We’re embedding 44 chunks in about 22s. So that’s 2 chunks per second.
    👉 This would take 22,000,000s or little over 254 days to embed all the documents at a cost of $0.48 per hour. Or, $2,926 for all the docs.
  • 1 Nvidia T4 GPU: Just like above, it takes on average 6.5s to embed our 44 chunks.
    👉 We’re still looking at 75 days. On top of the shorter time, we’re looking at $0.60 per hour -> $1,080 for all of the docs.
  • 1 Nvidia A10G: I had to throw a few more embeddings at it in parallel to get something close to what could be expected at max output per second. I ended up sending it 14 batches of 32 chunks (for a total of 448 chunks). This took around 5.5s on average. So that means that for our 44M chunks, this would take: (44m/448) * 5.5s.
    👉 A little over 6.25 days. At $1.30 per hour, this comes out to: $195.

NOTE 1: I tried their 4x Nvidia Tesla T4 as well, but it seems like they aren’t properly load balancing requests that are being sent in, so it was as slow as 1 Nvidia T4 and much more expensive. Actually, it costs over 7x the price of 1 Nvidia T4. In theory, for creating embeddings, you’d be better off deploying 7 independent Nvidia T4 servers and sending requests in a load balanced way. Either way, this would still be slower and more expensive than the A10G…

NOTE 2: I wanted to be able to test their A100, but I wasn’t able to, as I don’t have a quota. I’ll try and get some eventually. But in the meantime, I’ll leave it to the reader to try them out.

API solutions takeaway

Before we take a quick look at what all this would cost if we were to deploy things on our own cloud instances, rather than using an AI specialist provider, let’s take a look and appreciate the massive delta there is between some of these different solutions.

It seems clear that, firstly, unless you have a very good reason to want to use them, OpenAI embeddings are essentially too expensive and too slow to compute at scale. Cohere is reasonably fast, but costs about 100x the price of our cheaper option on here.

A rule of thumb seems to be that a better machine with a GPU yields not only faster embeddings (duh), but they are also cheaper per embedding.

What about large cloud providers?

You may be wondering what the cost would be of using large cloud providers to achieve the same.

Here, what we’ll do is simply look at the prices of the machines we’ve mentioned in the Hugging Face solutions for Google Cloud, Azure and AWS. All of these platforms provide many more machines than Hugging Face and I am working here under the assumption that two machines with the same specs perform the same on either of these platforms (I’ve only ran the tests on Hugging Face so far).

As I wasn’t able to get an A100 working on Hugging Face, I’m not reporting on prices here. Whenever I get the quota allocated to me on any of those platforms, I’ll update this article with estimates and prices for those machines too.

Either way, what we see is that essentially managed Hugging Face takes a margin over the on-demand price from AWS. But it’s not massive. Google Cloud is clearly cheaper on-demand ($0.35 vs $0.526 — or 33.5% cheaper…). When it comes to spot prices, you’ll find that they might be less interesting than they used to be as per: https://pauley.me/post/2023/spot-price-trends/ and that you would need to spend some money on engineering to guarantee some redundancies for when your instances to get preempted.

Overview table

Here’s a table taking in all the above and making it easier to compare.

Overall takeaways

If you are embedding small datasets and only doing a few ad-hoc embeddings per month: using a solution like OpenAI or Replicate might make sense. Cohere could also be an idea if speed of embedding is important to you. But they really are much more expensive.

For anything of any reasonable scale: you’ll be looking at either using Hugging Face’s managed solutions, or going direct to a cloud provider. Hugging Face allows you to deploy models with less effort, while large cloud providers offer more flexibility. It’s also possible to programmatically switch the servers on and off based on demand. At the end of the day, when planning your strategy, you need to figure out how much of an engineering effort will be required for any solution and the value of convenience.

If you are only embedding in batches: figure out how to switch machines on and off programmatically, or you’ll be paying for latent machines, churning up electrons and generating CO2 for no good reason.

If you’re going with GCloud, AWS or Azure: figure out how to deploy the models on the machines in an efficient manner and how much that will cost you. If you’re just embedding a smaller number of documents, the additional engineering effort of making a cloud solution work isn’t worth it — the opportunity cost of the time to get everything working means you’d be better off with something like Hugging Face.

Although we looked at 1m documents in this post, you may be looking at 100m documents or more (like in our case at Barnacle Labs), so all costs and times would need to be multiplied by 100 — safe to say, looking into A100s using spot on some large Cloud provider might be the way to go.

That’s it for the cost of embedders! In the next post, I’ll be talking about storage/retrievers and their associated costs.

--

--