GPT-3 tokens: what they are and how they work

7 min readMar 7, 2023

GPT-3 tokens: what they are and how they work

Text is generated token by token by large language models. But what exactly are tokens? What is their relationship to entropy and data compression?

Text is generated by generative language models (you may have heard of ChatGPT, GPT-3, Cohere, Google Lamda, BERT, GPT-2, and so on). Given some text as input, they can make very good guesses about what will happen next. At Quickchat, we use them as critical components of our chat engine, which powers Emerson AI and our business products.

For more articles like these visit: Data Science and Technology

Surprisingly, generative language models do not generate text word for word or letter for letter, but rather token for token. In this blog post, I’ll discuss tokens, entropy, and compression, and leave you with some unanswered questions.

Why are tokens used?

The basic answer is self-evident: there are too many words, and too few letters and tokens is a good compromise1. When training Machine Learning models, representation is critical — we want to present the model with data in a way that allows it to pick up on important features.

OpenAI has released a very cool tool that allows you to experiment with the text tokenization that they use for GPT-3. Let us use it to gain some insights.

Yes, I coined a new term. There is no dictionary in the world that has an entry for overpythonized. Google claims that the word does not exist anywhere on the vast Internet.

Nonetheless, you and I have a rough idea of what overpythonized might mean. That’s because we understand terms like over-parametrized and Python.

That is the central concept of subword tokenization. From its training data, your model can learn nothing about overpythonized, but there is a lot to learn about over, python, and ized.

In the preceding, I believe your codebase is excessively Pythonized and has been loosely translated into Polish. As expected, many more tokens are now required — we’ll get to that in a minute.

But another tragic event occurred here. Pr, z, ep, ython, ow, and any have been tokenized from przepytonowany. Believe it or not, the word prefix prize is very common in Polish. However, the tokenization that we are employing is based on English text. As a result, python is not a token in this case, which means that our sentence has nothing2 to do with the programming language.

Text that is the most cost-effective to feed into GPT-3

Text encoding is a type of tokenization. There are numerous methods for encoding text, as well as numerous reasons for doing so. Encoding text to compress it is a classic example. The basic idea is to assign short codes to frequently used symbols. That is the foundation of Information Theory.

It’s intriguing to consider what tokenization (the encoding described above) was designed to optimize for. Tokenization is intended to improve the performance of language models such as GPT-3. Of course, it isn’t anywhere near globally optimal because it should have been trained alongside the language model itself, whereas it is instead treated as an external component with which the model must interact.

Tokenization has undoubtedly not been optimized for minimizing the number of tokens per word, which is something to consider given that some of these models are charged per token rather than per character. Let’s see what kind of text is the most cost-effective to feed into GPT-33.

What is the hunch here? For text that is more similar to the typical text found on the Internet, fewer tokens per word are used. Only one in every 4–5 words in a typical text does not have a direct corresponding token.

Entropy

Consider another encoding whose goal is to compress text. The classic example here is encoding the English language, which has already been optimized (Huffman coding):

CHARACTERFREQUENCYCODESPACE16.2%111e10.3%010o8.2%1101t6.9%1011a6.2%1000s5.8%0111n5.5%0110r5.0%0010i4.0%11001u3.5%10101d3.4%10100c3.3%10010l2.6%00110h2.4%00010p2.2%00001m1.9%110000y1.7%100111g1.4%001111,1.3%001110f1.3%000111

The table above shows the best binary code to use for each of the top 20 most common symbols. What is the most common? Excellent question!

Huffman coding tells us how to optimally encode symbols to compress the English language, but first, we must define the English language. The table above is based on a sample from Facebook’s Terms of Service, which was previously mentioned. So the assumption here is that any English language that comes our way will be similar to the sample we used4.

While Huffman coding provides optimal encoding with compression in mind, there must also be an optimal tokenization algorithm with the “GPT-3, predict the next token” task in mind. Given some English-language text and a language model architecture (say, a Transformer), it would determine the best tokenization to use. That task would be extremely difficult because everything would have to be trained together and thus structured to be differentiable, among other complications. So let’s leave it at that.

Returning to the topic of compression, since we know how to find the best encoding for any symbol frequency distribution, a single quantity, Entropy, defines the smallest theoretically possible number of bits per symbol for a given data source, say the English language. The general rule is that the more uniformly symbol frequencies are distributed, the more bits per symbol we require.

As an example, consider the results of a biassed coin flip as a data source. Regardless of how much you tell me about the coin beforehand, if it is well-balanced (tails 50% of the time), you must transmit 1 bit per coin flip. If, on the other hand, you tell me ahead of time that it’s a magical coin that always comes up to heads, I need zero bits of information from then on to accurately predict all future outcomes.

The unanswered query

The entropy formula explains how to calculate the average number of bits per symbol for a data source (assuming optimal encoding):

A frequency p symbol will be encoded with approximately -log(p) bits. When we multiply that by its frequency, we get the following formula for each symbol’s contribution to entropy: -p * log (p). The formula above is the total of all those contributions.

It’s intriguing to consider what influences a symbol distribution’s average number of bits per symbol. Consider a source that produces one of 100 symbols, each with the same frequency of 1%. Entropy would be increased by -0.01 * log(0.01) = 0.02 for each of these symbols.

What about a source with 4 symbols that have frequencies of 70%, 10%, 10%, and 10%? The 10% would contribute 0.10, while the 70% would only contribute slightly more: close to 0.11. There is a trade-off here: a much more common symbol with a shorter code can contribute nearly the same amount.

This leads us to the final question: what is the frequency of a symbol that contributes the most to entropy?

It’s 36.79%!

What is the significance of that particular number? That’s part of the unanswered question I was referring to.

The amazing thing is that if you do the math, that number is exactly 1 / e. Yes, e, as in f(x) = e(x), is always equal to its derivative. And e, as in if you invest $1,000 with a 100% annualized return that compounded infinitely many times, you’ll end up with e times $1,000 (approximately $2,718). And the letter e, as in:

You’ve solved it if you can show me an intuitive connection between e and the frequency of a symbol that contributes the most to the entropy of a data source.

Btw, I don’t have an answer; none of the people I’ve asked over the years have given me one, and none have given me an answer that is more convincing that the derivation above does not exist.

For more articles like these visit: Data Science and Technology

GPT-3 tokens: what they are and how they work

Written by Muhammad Arsalan Akram