Optimizing your AI Deployment: An Analysis of Databricks’ Pay-Per-Token and Provisioned Throughput — Part 1

Published in

Tech Trends: GenAI & ML

8 min readJul 9, 2024

As businesses incorporate Large language models into their operations, understanding the cost implications and performance metrics of these technologies is crucial. LLMs such as Meta-Llama or DBRX offer capabilities that range from generating human-like text to solving complex analytical problems. However, as powerful as these models are, their operational costs and scalability need careful consideration.

Databricks Mosaic AI Model Serving now supports Foundation Model APIs, which allow you to access and query Databricks curated, state-of-the-art open models from a serving endpoint. With Foundation Model APIs, you can quickly and easily build applications that leverage a high-quality generative AI model without maintaining your own model deployment. Foundation Model APIs are published in the Databricks Model Marketplace and as fully managed endpoints on the workspace.

In this blog post, we delve into the two modes of accessing Databricks Foundation Model APIs: Pay-per-token and Provisioned throughput, and explore the various factors that need to be considered. Understanding the nuances of these two serving methods is crucial for optimizing cost, performance, and security:

Pay-per-token are fully managed shared endpoints that you can leverage for prototyping and running tests. They allow you to access popular OSS foundational models quickly, which makes them the fastest way to experiment Foundation Models on Databricks.
Provisioned throughput are dedicated endpoints with more predictable performance. They’re useful for production workloads, where you need to make sure that you meet performance SLAs , configurable latency and throughput. They also have additional controls so that you can set access control on who can use these endpoints. They’re ideal for deploying state-of-the-art base models or models that are fine-tuned on your proprietary data and for business-critical production workloads.

In this blog, we’ll explore the important metrics you need to consider for LLM serving and how they compare between pay-per-token and provisioned throughput. Part two of this series will provide more details on costing and ROI guarantees, latency requirements, security considerations and region availability when serving with pay-per-token and provisioned throughput. Finally, we will compare Databricks LLM serving with Azure OpenAI.

Important Metrics for LLM Serving

Let’s look at all the metrics you’ll need to consider for LLM serving.

Queries or requests per second

Rate limits refer to the maximum number of queries or requests that can be made to the model within a specific time frame. This is a common practice to ensure fair usage and maintain the system’s performance and availability. In this context, Queries Per Second (QPS) is the number of prediction requests a system can process per second. It’s a measure of a system’s ability to handle traffic and provide timely predictions. The importance of QPS lies in its direct impact on the performance and efficiency of a system. A higher QPS indicates that the system can handle more traffic and provide predictions more quickly, which is crucial for applications that require real-time or near-real-time responses.

Tokens per second/minute

However, it’s a bit more nuanced in the context of LLMs. There is a finite capacity for input context and output predictions. If you ask a question to an LLM, the output will vary depending on the type of question that is asked (e.g. “Can you write an essay of 90 pages summarizing all Shakespeare’s work” vs “What is the date today?”). This means that tokens are an important factor to consider. Instead of only looking at queries, you also need to look at the tokens per second metric (or per minute). Throughput is the number of tokens consumed and generated by the LLM. The higher the throughput, the higher the cost.

What is a token?
You can think of tokens as pieces of a word, or root of a word. In natural language processing, you need to convert words to a mathematical representation so that the model is able to understand the meaning, and the context behind those words. The reason we look at the root of a word in natural language processing is that it allows us to avoid having to vectorize all possible combinations of words in the vocabulary. For example, the root of the word ‘magnificent’ is ‘magn’, which means great/large. This same root word is also present in the following words: magnanimous, magnificent, magnifying… Generally, 1,000 tokens is about 750 words.

Time To First Token (TTFT) and Time Per Output Token (TPOT)

Time to first token (TTFT) refers to the speed at which users begin to see the model’s output after they input their query. For real-time, a low waiting time is crucial, whereas it’s less significant for offline workloads. The TTFT is influenced by the time it takes to process the input prompt and generate the first output token.

Time Per Output Token looks at the time it takes to generate an output token for each user querying the system. It’s a measure of how each user perceives the “speed” of the model. Once a first token is created, how long does it take to generate the rest? It might take 10 seconds to generate the first token, whereas the rest of the tokens might be generated very fast. TPOT is a useful metric, however, it’s common practice to look at TTFT rather than TPOT. This article provides more information on different metrics.

Overall Latency

This is the total time it takes for the model to generate a full response for a user. It can be calculated using the formula: latency = TTFT + (TPOT * the number of tokens to be generated).

Let’s examine how these metrics influence rate limits and quota restrictions for both pay-per-token and provisioned throughput.

Rate limits and Quota Restrictions

Quota restrictions apply for all the metrics defined above. Rate limits refer to the maximum number of queries or requests that can be made to the model within a specific time frame, or to the maximum number of tokens that can be consumed and generated by the LLM. Let’s explore how it applies in the context of pay-per-token and provisioned throughput in Databricks.

Pay-per-token

For Foundation Model APIs in pay-per-token mode, the rate limit is typically one query per second for the DBRX Instruct model and two queries per second for other chat and completion models. Embedding models have a default limit of 300 embedding inputs per second. Please note that you can contact your Databricks account team to request an increase in these limits. The pay-per-token mode is not designed for high-throughput applications or performant production workloads. Note that other vendors sometimes look at tokens per minute, instead of tokens per second.

In terms of concurrency, because the rate limit is one query per second for the DBRX Instruct model and two queries per second for other chat and completion models, you will start experiencing 429 errors when the number of users increases when using pay-per-token. This may happen or not, depending on how busy the endpoint is at that time, as you’re sharing this endpoint with other customers. This article explains well the difference between pay-per-token and provisioned throughput by taking the analogy of a highway, as shown in the image below. You can represent Databricks pay-per-token as a freeway, which facilitates the movement of cars (requests) to the models and back. Just like a freeway, you can’t control who else is using the road at the same time. This is similar to the service utilization we experience during peak hours. There’s a speed limit (rate limits) for everyone, but various factors can prevent us from reaching our expected speed. Additionally, if you’re managing a fleet of vehicles (analogous to service calls) all using different parts of the freeway, you can’t predict which lane will get congested. This is the risk we take when using the freeway. While high-demand times like rush hour are predictable, sometimes unexplained slowdowns occur. Consequently, your estimated travel times (response latency) can vary greatly depending on the traffic scenarios on the road.

Provisioned throughput

As a rule of thumb, you should move to provisioned throughput if you have more than 10 users. Provisioned throughput has performance guarantees, which is why this mode is recommended for all production workloads, especially those that require high throughput, performance guarantees, and fine-tuned models.

The default Queries Per Second (QPS) for a workspace is 200, but you can increase it to 25,000 or more by contacting your Databricks account team. Provisioned throughput is available in increments of tokens per second, with specific increments varying by model. To identify the suitable range for your needs, Databricks recommends using the model optimization information API within the platform.

Note that you can configure the maximum tokens per second throughput for your endpoint. Provisioned throughput endpoints automatically scale, so you can select ‘Modify’ to view the minimum tokens per second your endpoint can scale down to.

For example, if you select DBRX Instruct, the endpoint can scale from 0 tokens/second to 6000 tokens/second. It will be different if you select mixtral_8x7b_instruct_v0_1: the endpoint will be able to scale from 0 tokens/second to 6200 tokens/second. Every model has a finite number of tokens that it can consume and generate, as shown in the pricing page, so you’ll need to consider which model you need and the associated throughput. Based on your concurrency requirements, you can make a decision on the model you need, and on the required number of tokens per second.

If we take the highway analogy again, Provisioned throughput (PTUs) is more similar to traveling through a private road with predictable travel time and no other traffic than your own.

Provisioned throughput is analogous to traveling on a private road

The selection of the foundational LLM is a critical factor when considering serving needs for provisioned throughput. Models differ in various aspects such as their reasoning capabilities, context window size, and serving cost. Lama 3 70B needs one A100 machine to output 670 tokens per second, whereas Llama3 8B can output 3,600 tokens per second when using the same A100 machine. The complexity of the model directly impacts the processing speed and weight, which in turn influences the cost. The model architecture also affects the throughput. For example, DBRX can generate text at a rate of up to 150 tokens per second per user, which is considerably faster than many dense models.

Selecting a maximum throughput when creating a serving endpoint for DBRX instruct

Conclusion

In conclusion, the choice between pay-per-token and provisioned throughput when using Databricks Foundation Model APIs largely depends on your specific needs and requirements.

Pay-per-token is a cost-effective solution for prototyping and running tests, offering quick access to popular open-source foundational models. However, it lacks the performance guarantees and higher throughput capabilities of provisioned throughput, making the latter a more suitable choice for production workloads. Rate limits, costing and ROI guarantees, latency requirements, security considerations, and regional availability are all critical factors to consider when choosing between these two modes. Understanding these nuances is crucial for optimizing cost, performance, and security.

In part two of this blog, we’ll examine costs, latency requirements, security considerations, and supported regions when choosing pay-per-token vs. provisioned throughput. We’ll also compare Databricks LLM serving with Azure OpenAI.