Comparing GPT Tokenizers

Sweety Tripathi
7 min readMay 1, 2023

--

Breaking Down the GPT-2 and GPT-3 Tokenizers

Photo by Raphael Schaller on Unsplash

Transformers are powerful models that have revolutionized natural language processing (NLP) by achieving state-of-the-art performance on a variety of NLP tasks. One of the key components of these models is the tokenizer. In this article, we will discuss the tokenizer technology used in GPT-2 and GPT-3, and compare the pros and cons of each tokenizer.

GPT Tokenizer Technology

GPT-2 and GPT-3 both use byte pair encoding (BPE) as their tokenizer technology. BPE is a data compression technique that is widely used in NLP for tokenization.

Byte-Pair Encoding (BPE) was introduced in Neural Machine Translation of Rare Words with Subword Units . BPE works by iteratively merging the most frequent pair of characters in a corpus into a single token until a desired vocabulary size is reached. This approach has been shown to be effective in capturing both morphological and semantic information in text.

GPT-2 Tokenizer

GPT-2 tokenizer is available as an open-source library on Hugging Face’s Transformers platform, which provides access to a wide range of pre-trained language models, tokenizers, and other NLP tools.

To illustrate the GPT tokenizer, I am using the GPT2TokenizerFast class from the transformers library in Python. This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not.

Here’s an example of how to use the tokenizer to encode a sentence:

from transformers import GPT2TokenizerFast 
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

##With no space at starting
input_text = "Hello world"
# Encode the input text into token IDs
input_ids = tokenizer.encode(input_text)
# Convert the input IDs back into tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids)
# Print the tokens
print(tokens)
print("No of tokens:",len(tokenizer(input_text)["input_ids"]))

##Output : ['Hello', 'Ġworld']
No of tokens:2

In this example, we first import the GPT2TokenizerFast class from the transformers library and create an instance of the tokenizer by loading the pre-trained gpt2 model. We then pass a sentence “Hello world” to the tokenizer’s encode method, which converts the sentence into a sequence of token IDs. Finally, we convert the token IDs back into their corresponding tokens using the convert_ids_to_tokens method, which returns a list of tokens.

GPT-3 Tokenizer

GPT-3 uses a byte pair encoding (BPE) tokenizer, specifically a variant of BPE. This tokenizer is able to handle a much larger vocabulary size compared to the GPT-2 tokenizer.

Tiktoken is a BPE tokenizer for use with OpenAI’s models. It exposes APIs used to process text using tokens.

ChatGPT is not an open source model. But all the tokenizers have been open-sourced Starting with GPT2TokenizerFast (Transformers) and now tiktoken (which is supported by OpenAI themselves).

import tiktoken

# Returns the number of tokens in a text string.
def num_tokens_from_string(context, encoding_name):
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(context))
return num_tokens

#Here I used Model : 'text-davinci-003" so I used encoding:"p50k_base"
print("No of tokens:",num_tokens_from_string("Hello World", "p50k_base"))

## No of tokens: 2

How we determine encoding for a particular model:

Comparing encodings

Different encodings vary in how they split words, group spaces, and handle non-English characters. Here is the Github where you can explore more about the models and their encodings.

Using the methods above, we can compare different encodings gpt2 ,gpt3(p50k_base) ,gpt3.5(cl100k_base) on a few example strings.

import tiktoken
def compare_encodings(example_string):
"""Prints a comparison of three string encodings."""

# for each encoding, print the # of tokens, the token integers, and the token bytes
for encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:
encoding = tiktoken.get_encoding(encoding_name)
token_integers = encoding.encode(example_string)
num_tokens = len(token_integers)
token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
print()
print(f"{encoding_name}: {num_tokens} tokens")
print(f"token integers: {token_integers}")
print(f"token bytes: {token_bytes}")

print(compare_encodings("Hello World"))
#Example string: "Hello World"

# gpt2: 2 tokens
# token integers: [15496, 2159]
# token bytes: [b'Hello', b' World']

# p50k_base: 2 tokens
# token integers: [15496, 2159]
# token bytes: [b'Hello', b' World']

# cl100k_base: 2 tokens
# token integers: [9906, 4435]
# token bytes: [b'Hello', b' World']

print(compare_encodings(" Hello World"))
# Example string: " Hello World"

# gpt2: 2 tokens
# token integers: [18435, 2159]
# token bytes: [b' Hello', b' World']

# p50k_base: 2 tokens
# token integers: [18435, 2159]
# token bytes: [b' Hello', b' World']

# cl100k_base: 2 tokens
# token integers: [22691, 4435]
# token bytes: [b' Hello', b' World']

print(compare_encodings(" Hello World"))
# Example string: " Hello World"

# gpt2: 3 tokens
# token integers: [220, 18435, 2159]
# token bytes: [b' ', b' Hello', b' World']

# p50k_base: 3 tokens
# token integers: [220, 18435, 2159]
# token bytes: [b' ', b' Hello', b' World']

# cl100k_base: 3 tokens
# token integers: [220, 22691, 4435]
# token bytes: [b' ', b' Hello', b' World']

I don’t have any proof from OpenAI sources but from above example for same example_string gpt2 and gpt3 have same token value it suggested that “Under The Hood” BPE used are similar for both the models.

If you’d like to tokenize text programmatically, use Tiktoken as a fast BPE tokenizer specifically used for OpenAI models.

Alternatively, you can use OpenAI interactive Tokenizer tool, which allows you to calculate the number of tokens and see how text is broken into tokens.

GPT-2 vs GPT-3 Tokenizers

  1. Vocabulary Size: GPT-2 has a vocabulary size of 50,257, while GPT-3 has a much larger vocabulary size(we don’t have access to the embedding matrix of the newer models. Instead, however, we can just look at the tokens in the cl100k_base vocabulary and try messing around with each of them). This means that GPT-3 is able to capture more fine-grained information in text than GPT-2.
  2. Pre-training Data: GPT-3 was trained on a 570 GB corpus of text than GPT-2 i.e 40 GB, which includes web pages, books, articles, and more. This allows GPT-3 to have a more diverse and representative understanding of language and world knowledge.
  3. Encoding Algorithm: While both GPT-2 and GPT-3 use Byte Pair Encoding (BPE) for subword tokenization, GPT-3’s tokenizer also includes a variation of the BPE algorithm, which can capture more fine-grained linguistic structures.
  4. Time :
Performance measured on 1GB of text using GPT2TokenizerFast and tiktoken

Both tiktoken and HuggingFace tokenizers are effective at processing data, tiktoken appears to be significantly faster in this particular comparison. However, it’s important to note that there are many factors that can influence processing time, and the optimal solution will depend on the specific use case and requirements.

5. Token Pricing:

GPT-2 Tokenizer: Hugging Face provides access to pre-trained GPT-2 models and tokenizers through their Transformers library, which is an open-source software library for NLP. The library can be used for free, and the code is available on GitHub for anyone to download and use.

GPT-3 Tokenizer: The API offers multiple model types at different price points. Each model has a spectrum of capabilities, with davinci being the most capable and ada the fastest. Requests to these different models are priced differently. You can find details on token pricing here.Also, the total cost would include not only tokens fed into GPT-3 but also tokens generated by it.

Pros and Cons of Each Tokenizer

Both GPT-2 and GPT-3 tokenizers have their strengths and weaknesses. Here are some pros and cons of each tokenizer:

GPT-2 Tokenizer

Pros:

  • Smaller vocabulary size makes it more efficient and easier to use
  • Works well for smaller-scale NLP tasks
  • Open Source , freely available (Free Free Free … !!!!)

Cons:

  • May not capture fine-grained information in text as well as GPT-3 tokenizer
  • May struggle with more complex NLP tasks that require more context

GPT-3 Tokenizer

Pros:

  • Larger vocabulary size allows it to capture more fine-grained information in text
  • More efficient BPE algorithm allows it to handle larger vocabularies without sacrificing performance
  • Works well for complex NLP tasks that require more context

Cons:

  • Larger vocabulary size can make it more memory-intensive and slower to use
  • More complex and difficult to fine-tune than GPT-2 tokenizer
  • Highly priced

Conclusion

In general, the choice between GPT-2 and GPT-3 tokenizers depends on the specific needs of the NLP task at hand. For simpler tasks that don’t require a lot of context, GPT-2 may be sufficient and more efficient to use. However, for more complex tasks that require more context and fine-grained information in text, GPT-3 may be a better choice.

Thank you for reading!🤗I hope that you found this article both informative and enjoyable to read.

Fore more information like this follow me on LinkedIn

References:

  1. https://huggingface.co/transformers/v3.2.0/model_doc/gpt2.html
  2. https://en.wikipedia.org/wiki/GPT-2#Generative_Pre-trained_Transformers
  3. https://enjoymachinelearning.com/blog/the-gpt-3-vocabulary-size/
  4. https://en.wikipedia.org/wiki/Byte_pair_encoding
  5. https://github.com/huggingface/tokenizers
  6. https://github.com/openai/tiktoken

--

--

Sweety Tripathi

Data scientist with a interest in NLP/GenAI and a love for dance, exploring the intersection of art and technology.