How To Reduce The Cost Of Running The GPT Models

Paulo Marcos
17 min readNov 10, 2023

--

In this article you will find effective methods, with examples, to chop down the infrastructure cost of running your GPT powered apps so that your business can focus the cost on further empowering your product or service. Depending on your application, you can reduce up to 90% in costs!

This article is brought to you by 🐺Wolfflow AI

Generated with Stable D.

By the end of the article we will know:

  1. How OpenAI charges for their models
  2. What are tokens and how to count them
  3. How to reduce tokens from user generated input
  4. How to reduce tokens from engineer generated input prompt
  5. How to save costs by applying cache
  6. How to further save costs by generating cache ahead of time
  7. The difference in price between GPT-3 and GPT-4 models

⚠️ Please note, this article assumes that either you or one of your team members have a moderate understanding of Python.

Let’s then prepare our gear and embark into this marvelous adventure together.

1. OpenAI API Costs

Extracting the price from the OpenAI page (as of November 2023), we have:

GPT-4-Turbo

Model Input Output

1106-preview $0.01 / 1K tokens $0.03 / 1K tokens

1106-vision-preview $0.01 / 1K tokens $0.03 / 1K tokens

GPT-4

Model Input Output

8K context | $0.03 / 1K tokens | $0.06 / 1K tokens

32K | context $0.06 / 1K tokens | $0.12 / 1K tokens

GPT-3.5 Turbo

Model Input Output

turbo-1106 | $0.0010 / 1K tokens | $0.002 / 1K tokens

turbo-instruct | $0.0015 / 1K tokens | $0.002 / 1K tokens

What that means is: whenever you use the API with a post request, OpenAI will charge $0.01 per 1K tokens to receive your request and another $0.01 per 1K tokens to generate the text when using GPT-4-turbo. We will cover what exactly tokens are and how to count them in the next section.

Let’s take a look at it in practice and see how it charges per request. For this example, we ask GPT to explain in exactly 50 words what “GPT” is. We will be using a simple ChatCompletion method with the GPT-4 model.

import openai # Version 1.1.1 for this article

client = openai.OpenAI()
response = openai.chat.completions.create(model="gpt-4",
messages = [
{'role': 'user', 'content': 'Explain in exactly 50 words what is ChatGPT.'}
],
temperature = 0 )

What we get is the response which consists of not only the output by the model but also how many tokens were received and how many were generated:

{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 16899999,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "ChatGPT is an artificial intelligence model developed by OpenAI. It uses machine learning to generate human-like text based on the input it receives. It's designed for various applications, including drafting emails, writing code, creating written content, tutoring, translating languages, and simulating characters for video games."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 60,
"total_tokens": 80
}
}

You can see that in usage we have: "prompt_tokens": 20 and "completion_tokens": 60 giving a total of 80 tokens, which will be used by OpenAI to charge your account.

With that in mind, let’s see how much OpenAI charged for the following number of requests:


| Time | Model | Requests | Prompt Tokens | Completion Tokens | Total Tokens |
|-----------|-------------|----------------|---------------|-------------------|--------------|
| 12:15 AM | gpt-4-0613 | 1 request | 189 | 203 | 392 |
| 2:20 AM | gpt-4-0613 | 1 request | 233 | 298 | 531 |
| 2:30 AM | gpt-4-0613 | 1 request | 153 | 94 | 247 |
| 5:55 AM | gpt-4-0613 | 1 request | 20 | 60 | 80 |
| TOTAL | | 4 requests | 595 | 655 | 1250 |

These 4 requests resulted in an API fee of USD 0.05. Let’s break it down further:

595 tokens were used in the input prompts and 655 tokens were used in the output for the completion.

Total cost for prompt tokens = (Total prompt tokens / 1000) * Cost per 1K prompt tokens
Total cost for prompt tokens = (595 / 1000) * $0.03
Total cost for prompt tokens = $0.01785

Total cost for completion tokens = (Total completion tokens / 1000) * Cost per 1K completion tokens
Total cost for completion tokens = (655 / 1000) * $0.06
Total cost for completion tokens = $0.03930

Total cost = Total cost for prompt tokens + Total cost for completion tokens
Total cost = $0.01785 + $0.03930
Total cost = $0.05715

Now that we figured out the cost of running the API, let’s understand exactly what tokens are and how we can count them before sending the request to the API.

1.1 Understanding Tokens

A token is a chunk of text that is processed by LLMs. This chunk can be a whole word like hello or a piece of word like G and PT. Programmatically, we can use tiktoken, an open-source tokenizer by OpenAI, which efficiently splits text strings into tokens (e.g., "tiktoken is great!" into ["t", "ik", "token", " is", " great", "!"]). This process is valuable for GPT models, as it enables tracking token counts to assess model processing capabilities and API call costs, which are based on tokens used.

In the documentation, OpenAI provides the following piece of code as an example to test the size of your input prompt:

import tiktoken


def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
"""Returns the number of tokens used by a list of messages."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
if model == "gpt-3.5-turbo-0613": # note: future models may deviate from this
num_tokens = 0
for message in messages:
num_tokens += 4 # every message follows <im_start>{role/name}\n{content}<im_end>\n
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name": # if there's a name, the role is omitted
num_tokens += -1 # role is always required and always 1 token
num_tokens += 2 # every reply is primed with <im_start>assistant
return num_tokens
else:
raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")

Which we can use to test out our example from earlier:

model = "gpt-3.5-turbo-0613"
messages = [
{'role': 'user', 'content': 'Explain in exactly 50 words what is ChatGPT.'}
]
print(f"{num_tokens_from_messages(messages, model)} prompt tokens counted.")

# Which prints: 20

The result is 20 tokens. Notice that is the number of tokens of the whole message, which includes the keys and values of role and content. But what about the size of just the prompt we used? We can check that by running the following code:

encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
encoding.encode("Explain in exactly 50 words what is ChatGPT.")
>>> [849, 21435, 304, 7041, 220, 1135, 4339, 1148, 374, 13149, 38, 2898, 13]
# Now let's use len() to get how many tokens there are
len(encoding.encode("Explain in exactly 50 words what is ChatGPT."))
>>> 13
# What if we delete the final dot?
len(encoding.encode("Explain in exactly 50 words what is ChatGPT"))
>>> 12

Notice what we did there in the last command. We removed the last dot, and it indeed reduced the size by 1 token. The question that follows is:

Are there other ways to reduce the size of the input prompt so that the cost is also reduced? Let’s dive deep into it in the next section.

2. Effective ways to shorten your prompts

Depending on how you will work with the OpenAI API, you will have to handle the input in different ways.

2.1 Handling User Input

Let’s say you have a chatbot, and you will pass your users input through GPT-4 so the chatbot can answer it properly. Since you cannot control what the user will input, you can parse the string and remove all unnecessary tokens from it. Here are some examples of how differently users can input the same sentence:

# User 1 input:
I think this is too expensive
# User 2 input:
I think this is too expensive!
# User 3 input:
I think this is too expensive!!!
# User 4 input:
I think this is too expensive!!!!!!!!!!!!!
# User 5 input:
I think this is too expensive!!!!!!!!!!!!!!!!! 😡😡😫

Now let’s run these strings through tiktoken and see how it returns the tokens:

encoding.encode("I think this is too expensive")
>>> [40, 1781, 420, 374, 2288, 11646]
encoding.encode("I think this is too expensive!")
>>> [40, 1781, 420, 374, 2288, 11646, 0]
encoding.encode("I think this is too expensive!!!")
>>> [40, 1781, 420, 374, 2288, 11646, 12340]
encoding.encode("I think this is too expensive!!!!!!!!!!!!!!!!!")
>>> [40, 1781, 420, 374, 2288, 11646, 51767, 17523, 70900]
encoding.encode("I think this is too expensive!!!!!!!!!!!!!!!!! 😡😡😫")
>>> [40, 1781, 420, 374, 2288, 11646, 51767, 17523, 70900, 27623, 94, 76460, 94, 76460, 104]

You can see that the tokenizer encodes both ! and !!! to a single token. If you try with 4! you will see that the same thing happens. Now we know that special marks are not always each labeled as a single token.

However, you probably notice that, depending on your application, we can reduce the “User 5 input” string to be the same as the “User 1 input” string and the output will probably be the same. Make sure you test it first, though. We can ensure that user input messages don’t blow up our usage quota, at least for this type of cases.

We can remove the unnecessary characters of a string easily in Python by using a library called re that handles regular expressions:

import re
import emoji

def remove_unnecessary_characters(text):
# Remove repeated exclamation marks, question marks, and other symbols
text = re.sub(r'[!?]+', r'!', text)
# Remove any repeated whitespace characters (spaces, tabs, etc.)
text = re.sub(r'\s+', ' ', text)
# Remove trailing whitespace
text = text.strip()
# Remove emojis from the text
text = ''.join(c for c in text if c not in emoji.UNICODE_EMOJI)
# Remove last dot or comma since it is probably not needed
while len(text) > 1 and text[-1] in [",", ".", " "]:
text = text[:-1]
return text

We can test it with:

# Test cases
text1 = "I think this is too expensive!!!!!!!!!!!!!!!!! 😡😡😫"
text2 = " Many spaces in this string ! LOL"
text3 = "Probably we don't need this last dot too."

result1 = remove_unnecessary_characters(text1)
result2 = remove_unnecessary_characters(text2)
result3 = remove_unnecessary_characters(text3)

print(result1) # Output: "I think this is too expensive!"
print(result2) # Output: "Many spaces in this string ! LOL"
print(result3) # Output: "Probably we don't need this last dot too"

You can further improve this code by removing even further characters. Let’s say you are sure you don’t need to process special characters such as @ and #. You can then use the re library to remove those as well.

Amazing! Now we know how to prevent unnecessary characters increasing our token count as an input for OpenAI API requests!

In the next section we will look at another way that can help us further decrease the cost of running the API.

3. Reducing cost by caching the prompts

Imagine again that you have a chatbot that takes the user input, processes it with a GPT model and returns the response to the user. Often times the users will ask the same question, which will make the GPT model to return the same answer. For example, let’s say your chatbot is fetching data from a PDF that you uploaded. The PDF is a static document that doesn’t change. In this case, if you have already processed the question before, there is a chance you will not need to send the request to the OpenAI API again. We can achieve that by having a cache that we can look up before requesting the API to process the input.

The flow without caching is as follows:

  1. User input message “How much does your product cost?”
  2. You send this message to the GPT model
  3. GPT model processes the data and returns the response with input and output token cost ($)
  4. You return the data to the user

Now, after introducing the cache to our flow:

  1. User input message “How much does your product cost?”
  2. Look at the cache to see if message exists.
    2.1 If exists, fetch the data, without the GPT cost. Skip to step 4.
    2.2 If it doesn’t exist, send to the GPT model. Then add to cache.
  3. GPT model processes the data and returns the response with input and output token cost ($)
  4. You return the data to the user.

In Python, this is how we can implement caching:

# Use dictionary as cache for this simple example
cache = {}

def process_user_input(text):
# Get response from cache and return it.
response = cache.get(text, None)
if response:
return response
# If not present in cache, then we get the data from GPT.
response = fetch_data_with_gpt(text)
# Add the response to our cache dictionary
cache[text] = response
return response

Pretty cool! Now we know how to reduce costs by using a cache and preventing our app to request API if it is not necessary.

However, there is a flaw in our approach. It only works if the input text matches exactly what is present in the cache. For example, three users are asking the same question, essentially, but the strings are different:

1. How much does your product cost?
2. What is the price of your product?
3. What would be the price of your product?

So how can we preemptively generate data to further decrease our cost of running the OpenAI API? Let’s dive deep into it in the next section.

Before we move forward, I bet you’re just like me, eager for the latest and greatest AI content. That’s why I’ve put together a fantastic weekly newsletter featuring the top AI and automation news to boost your productivity. Get it straight to your inbox here.

4. Generate cache ahead of time

So now that you’ve subscribed to the newsletter we can proceed! 😄 As we build our AI enhanced app, we will be expecting some type of input from the users. In the example above, we want our chatbot do be able to answer “How much does your product cost?” without sending it to the OpenAI API. We can achieve it easily as long as the exact input text is present in the cache.

There are many ways of populating your cache ahead of time. Here are 3 interesting ways we can do it:

4.1 Use ChatGPT to generate cache for you

You can ask ChatGPT for help and add the generated data into your cache. An example is as follows:

Simulate 5 different ways that users can ask the same question as the one below:
"How much does your product cost?"

There ChatGPT brought us the 5 different ways to ask the same question:

What is the price of your product?
Can you tell me the cost of your product?
How much do I need to pay for your product?
What's the price tag on your product?
How do I find out the cost of your product?

You can ask it to generate 5, 10, 20, 40 different ways. Another tip is to add in the prompt an instruction for ChatGPT to generate the list in a Python dictionary format, which you can easily copy and paste into your program:

Simulate 20 different ways that users can ask the same question
as the one below. Make it in Python dictionary format, where the
keys are the sentences and the values are "Example Response".
"How much does your product cost?"

Which ChatGPT will answer:

questions = {
"What is the price of your product?": "Example Response",
"Can you tell me the cost of your product?": "Example Response",
"How much do I need to pay for your product?": "Example Response",
"What's the price tag on your product?": "Example Response",
"How do I find out the cost of your product?": "Example Response"
}

You will only then replace “Example Response” with the actual response from your app.

Pro-tip: There is a small caveat with this approach too: string comparison in Python is case-sensitive, which means that What is chatgpt? is not the same as What is ChatGPT?. Before you store and compare data, make sure you apply the lower() method to your string:

response = cache.get(text.lower(), None)
...
cache[text.lower()] = response

4.2 Use the OpenAI API to generate cache

Often times we don’t know what users will input into our apps, which makes it hard for us to foresee and prepare the cache ahead of time. In that case, we can receive the user input, process the input with the GPT models, then before adding it to the cache and returning it to the user, we can send another request to the API to dynamically generate the cache for us.

This approach seems to contradict our cost reduction strategy. However, based on your application’s duration and needs, it might be better to invest your cache first (pun intended) during the app’s development and testing cycles. It can later be turned off when no longer required.

An example implementation of this feature in Python would look like:

cache = {}

def process_user_input(text):
# Get response from cache and return it.
response = cache.get(text, None)
if response:
return response
# If not present in cache, then we get the data from GPT.
response = fetch_data_with_gpt(text)
# Generate cache data with GPT-3, return data can be a list of possible strings
cache_data = generate_cache(model="gpt-3", text=text)
# Add user input text to the cache_data as well
cache_data += text
# Then for every string in the cache_data we add it to our cache
for data in cache_data:
cache[data] = response
return response

Note that there are various ways to optimize this code, such as adding to and retrieving from the cache. However, we will omit this part as it goes beyond the scope of this article.

4.3 Use a free model to generate cache

We might be able to replicate the strategy above, but instead of using paid GPT models, we can use free models such as Llama 2 and others to perform this task for us.

The downside of this approach is that it requires quite some time to install and test these models.

Absolutely superb! We’ve yet found another great way to reduce costs. Now, not only we know how to remove unnecessary characters from the user’s input string, but we also know how to cache them effectively and preemptively.

Are there still other ways that we can reduce the cost of running the OpenAI API for our AI enhanced applications?

Let’s take a look at how we can do that in the next section.

5. GPT-3 VS GPT-4: which one to pick and when?

In the beginning of the article we laid down the costs of running both GPT-3 and 4 models. There are pros and cons of running each one and, most importantly, their costs are quite different. Knowing when to use a particular model can dramatically reduce the costs of running your app.

Let’s say your app is getting popular, and you’ve reached a usage of 2 million tokens in your OpenAI account. The difference in costs can be:

GPT-4: 90 USD for two million tokens (1m input + 1m output)
GPT-3: 3.5 USD for two million tokens (1m input + 1m output)

Multiply that usage by a 12, assuming that your usage above is for a single month, and you will see that the difference can be outstanding: 1080 dollars a year for GPT-4 VS 42 dollars a year for GPT-3–0613.

As a rule of thumb, we should always be using GPT-3 to run our AI powered apps if we know that the answers GPT-3 can provide will be of the same quality of GPT-4. But how do we know that? The answer is simple: by testing it. Performing several tests upfront can save you hundreds of dollars in the long run. We recommend using the Playground for testing your prompts, or creating short test codes with Python, so you can do it before you release it into production.

If you have ChatGPT Plus subscription, you can also test your prompts in both the 3.5 and 4 models, which will give you a pretty good idea of the difference between the answers these models provide.

5.1 Differences between GPT-3 and GPT-4

OpenAI lays down a pretty informative guide in their documentation, so we recommend you reading it. We went ahead and used ChatGPT to extract that text into a concise table:

| Model          | gpt-4           | gpt-3.5-turbo   |
|-----------|-----------------|-----------------|
| Performance | Better | Less capable |
| Instructions | Carefully follows | Follows just a part |
| Hallucination | Less likely | More likely |
| Context Window | 8,192 tokens | 4,096 tokens |
| Latency | Higher | Lower |
| Cost per Token | More expensive | Less expensive |

Basically, one should use GPT-4 for complex tasks such as prompts that have many instructions in them. Another way of finding out if you need to use GPT-4 is by first running GPT-3, verifying the result it provides is wrong (or not complete), then testing it with GPT-4 and confirming the result it gives is actually what you expect in the majority of the times.

Now that we know that we can know how to save money by using GPT-3 as much as possible, on the next section we will see how we can reduce the cost on the input prompt by mastering how we craft them.

6. Mastering Prompts Towards Token Reduction

For this last section, we will be focusing on how we can reduce our input prompts to further increase our savings.

Let’s say our application gives a score to a message based on how good or bad the grammar of that message is. We can craft the prompt as follows:

Try to extract a score from the email below and give it a score from 0 to 10 based on how 
grammatically correct the email is: 0 very bad grammar, 10 no errors at all:

To test it, we will have two inputs, one with bad grammar and another with better grammar:

msg_1 = Dear Stevie, I woud like to kno how are you? I am contacting you becuase I feel we have'nt entered in contact for a while now.
msg_2 = I would like to know how are you? I am contacting you because I feel we haven't talked for a while now.

Running both messages in ChatGPT with GPT-3.5, it gives us a score of 4 for msg_1 and 9 for msg_2.

How many tokens would the header prompt be? We can use tiktoken again to provide the value for us:

len(encoding.encode("""Try to extract a score from the email below and give it a score from 0 to 10 based on how grammatically correct the email is: 0 very bad grammar, 10 no errors at all"""))
>>> 42

We can always try to optimize the way we craft our prompt by reducing the words or even replacing them for more detailed ones. A pro-tip is to ask ChatGPT to reduce the prompt as much as possible. From there, it is our job to verify it works properly and add any words that will help the models to give better quality results.

You are a prompt master. Your job is to reduce the following prompt to the least amount of tokens as
possible without compromising the quality of the output.
Prompt:
Try to extract a score from the email below and give it a score from 0 to 10 based on how grammatically
correct the email is: 0 very bad grammar, 10 no errors at all

Which ChatGPT answers:

Rate email grammar: 0-10 (0=bad, 10=flawless).

That is a huge reduction in token count! Depending on your application you might need to add something like Return score only to the prompt, so it will refrain from giving a verbose answer. Let's see how much is the actual reduction in token count:

len(encoding.encode("Try to give a priority score to this email based on how likely this email will leads to a good business opportunity, from 0 to 10; 10 most important"))
>>> 42
len(encoding.encode("Rate email grammar: 0-10 (0=bad, 10=flawless). Rate score only:"))
>>> 25

That was a 41% reduction in the number of tokens! Superb.

We can do the same for the output as well. Depending on the application we are building, adding sentences like … in exact 50 words_, … give values only and do not explain it can further reduce the cost of running your application by making the model return a specific token count, which gives us a more deterministic result (engineers rejoice).

The ball now is in your court to craft the best prompts for your applications in order to keep quality and reduce costs.

Conclusion

Congratulations! We finally came down to the end of this article! Today we went deep into the world of API costs, and we uncovered:

  1. How OpenAI charges for their models
  2. What are tokens and how to count them
  3. How to reduce tokens from user generated input
  4. How to reduce tokens from engineer generated input prompt
  5. How to save costs by applying cache
  6. How to further save costs by generating cache ahead of time
  7. The difference in price between GPT-3 and GPT-4 models

I truly appreciate your time reading the entirety of this article and I hope you found it useful. Let me know how much your business has saved!

If you think this was a valuable article, please consider the following:

  • Subscribe to the newsletter
  • Implement today’s cost saving techniques into a resume reviewer app that we discussed in this article
  • Follow me in Medium for more in-depth AI articles

Have a wonderful evening! ⭐️ See you in the next article.

--

--

Paulo Marcos

AI applied to business | Software Engineer - AI Specialist | Founder of Wolfflow AI