Enhance Your GPT Experience: Tiktoken Unveiled — Free Token Counting for Prompts (with Python Code!)

Ilyes Rezgui
3 min readJul 7, 2023

--

Source : chatgpt

If you’re a developer who is working with the OpenAI API and who is in need to determine the length of their prompt, then you came to the right place. In this blog post, we’re gonna walk you through how to calculate the number of tokens of a given text using the open-source tiktoken tokenizer of OpenAI.

1- What are tokens

Tokens are units that represent frequently encountered character sequences. When working with the OpenAI models, tokens are considered the most basic units that the model uses to compute the length of a prompt. For instance, when the term “tokenization” is examined, it can be broken down using a defined grammar into “token” and “ization”, whereas a common short word like “sun” is treated as a single token.

you can try the UI provided by OpenAi to calculate tokens by clicking this link.

Figure 2: tokenization example on OpenAi’s UI (Source : author)

Once tokens are identified they’re mapped into Ids with the help of a defined grammar. you can think of it as a dictionary of keys/values where each token/key is mapped into an integer/value that represents it.
for the example of the word “tokenization” we can define the two tokens “token” and “ization” where “token” is identified by the integer 30001, and “ization” is identified by the integer 1634.

Figure3 : tokens ids for the word tokenization (Source : author)

2- The tiktoken encodings

To calculate the number of tokens in a given text we can use the tiktoken open-source tokenizer developed by OpenAI.

For different models, the company offers different encodings. The “cl100k-base” is used when working with the gpt-4, gpt-3.5-turbo, or the text-embeddings-ada-002 models. The p50k_base on the other hand is used when working with codex, text-davinci-002, or text-davinci-003 models. Finally “p50k_base” or “gpt2” encoding is used for GPT-3 models and is the same one offered by ElutherAI on hugging face.

you can check the GPT-NEO tokenizer via this link .

Figure4: classification of encoding names based on models (Source : Openai)

3- Python implementation

First let’s pip install it , so type on your CMD :

pip install tiktoken

Once the tool is installed we can start writing our python code :

#we first import it 
import tiktoken
# we Use tiktoken.get_encoding() to load an encoding by its name.
encoding = tiktoken.get_encoding("p50k_base")
#We use tiktoken.encoding_for_model() to automatically load the correct encoding for a given model name
encoding = tiktoken.encoding_for_model("text-davinci-003")
# to convert a text into a list of token ids we use the encode() method
print(encoding.encode("Hello community !"))
# to get the number of tokens representing that text string we simply take the length of the list
print(len(encoding.encode("Hello community !")))

4- Execution

Figure 5 shows an example of tokenization using the “p50k_base” encoding for the model “text-davinci-003”.

Figure 5: execution example (Source : author)

--

--

Ilyes Rezgui

I write about life and technologies || CS Research masters student || Ai research