AI Technology

‘Breaking Down’ Tokenizers in LLMs

An introduction to tokenizers and their implications in language models.

Semin Cheon
SqueezeBits Team Blog

--

AI generated image

Using tokenizers, or tokenization, is the first and fundamental step in the NLP pipeline. It is the process of translating natural language(text input) to an appropriate format(numbers) so that a machine learning model can understand it. Tokenizers break down the text into smaller pieces such as words or characters. These smaller pieces are called ‘tokens.’ Tokens come together to construct a vocabulary group and will later be encoded into numerical data. The numbers will be fed into the machine learning model, making it easier for the model to find patterns and relationships.

Several types of tokenizers, as preprocessing tools, are available for public use as open-source libraries. Some well-known tokenizers are Natural Language Toolkit(NLTK), SpaCy, BERT Tokenizer, and AutoTokenizer. Each tokenizer employs a different tokenization method that will tokenize the text differently. More specifically, tokenization methods can be categorized by various levels of granularity. Coarse tokenization methods include word-based tokenization and whitespace tokenization, while character-based tokenization is an example of a more granular method.

OpenAI GPT-3.5 & GPT-4 Tokenizer

The granularity of the tokenization method impacts how much data will be induced from the text. More granular methods induce more token data, meaning it will need more memory, more processing power, and more time downstream(Kuzminykh, 2024). Coarser methods consume less computational resources but would risk the model losing the semantics and expressiveness in meaning. The ideal tokenizer would be the sweet spot between computational efficiency and linguistic depth.

Many contend that ‘subword-based’ tokenization is the perfect level of granularity between word-based and character-based tokenization. Today’s popular transformer models such as OpenAI’s GPT series employ subword-based tokenization methods to increase efficiency. Other subword-based tokenization methods include Byte Pair Encoding(BPE), WordPiece(used on BERT model), and SentencePiece(from Google).

Byte Pair Encoding(BPE), in particular, has gained popularity as it is the core of GPT models. BPE was originally a data compression technique from 1994, but later applied to tokenization by Senrich in 2016. The BPE algorithm iteratively searches for frequent pairs of characters and merges them into a new symbol. The symbols are continuously added to the vocabulary until a sufficient vocabulary size is constructed.

The BPE tokenization scheme is commonly favored because it is proficient at compressing text, meaning larger amounts of text can be represented by shorter sequences of tokens. It also effectively helps the language model understand random, unfamiliar words, mitigating the issues associated with ‘Out-of-Vocabulary(OOV)’ words. This is especially useful in specialized domain texts.

An important point in understanding tokenizers is that they have their own training algorithm and training set separate from the model. Through training, the tokenizer finds the best subwords from the text based on the algorithm. This is described as a statistical and deterministic process that achieves the same results every time, unlike model training that uses stochastic gradient descent. Due to this aspect, it has been portrayed as a ‘completely separate stage of the LLM pipeline (Jindal, 2024).’

However, despite great strides in tokenizers' development, issues still arise from the limitations of tokenizers’ multilingual capabilities. Recent studies(Ahia, 2023) on the effects of subword tokenization in LLMs indicate that there are ‘disparities’ in tokenizing different languages leading to higher costs for particular languages. Even if the text conveys the same information, Latin-based scripts are less fragmented and are represented with fewer tokens than other language scripts. The more heavily segmented, over-fragmented languages such as Thai will cost more on the pricing system of ChatGPT. This is unfair for speakers from unprivileged societies to access language models. For instance, if a user communicates in Telugu, the official language of Andhra Pradesh, India, she may spend 5 times more than the English user from the US on the same model. The authors highlight the necessity of rethinking pricing models so as not to exclude minority populations from using language technologies.

average number of tokens by script (source)

As previously mentioned, because tokenization is the first step in an NLP task, it has ripple effects on the rest of the pipeline. Andrej Karpathy, OpenAI co-founder and former researcher, states that many abnormal results and problems of LLMs often trace back to tokenization. The tokenizer impacts downstream model components and the quality of the results generated. In addition, its implications in the computational costs of a language model are significant, as stated by a publication from AWS on generative AI. The author states tokenizer choices affect model size, training time, and inference speed. Optimizing the tokenization process increases efficiency by saving computational resources, particularly in the case of large-scale deployments. Furthermore, by training tokenizers on domain-specific data, performance enhancements such as understanding and generating language on specialized domains can be achieved. Hence, selecting the right tokenizer for an NLP task should not be overlooked. A more prudent and thoughtful selection of tokenizations will help machine learning models better understand the complexities of the human language.

AI model optimization is a multifaceted process. Multiple aspects, including model compression(pruning and quantization) and the tokenization procedure, need to be considered. SqueezeBits is always committed to staying current with all AI technologies and maintaining pace with industry trends. If you’re interested in our innovative solutions towards AI model optimization, visit the links below or contact us at info@squeezebits.com

References

[1] Kuzminykh, N. (2024). Calculating Token Counts for LLM Context Windows: A Practical Guide. https://winder.ai/calculating-token-counts-llm-context-windows-practical-guide/

[2] HuggingFace. NLP Course documentation. Training a new tokenizer from an old one. https://huggingface.co/learn/nlp-course/chapter6/2

[3] Jindal.S. (2024). Former OpenAI Researcher Andrej Karpathy Unveils Tokenisation Tutorial, Decodes Google’s Gemma. https://analyticsindiamag.com/former-openai-researcher-andrej-karpathy-unveils-tokenisation-tutorial-decodes-googles-gemma/

[4] Ahia, O., Kumar, S., Gonen, H., Kasai, J., Mortensen, D. R., Smith, N. A., & Tsvetkov, Y. (2023). Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. ArXiv. /abs/2305.13707

[5] Subramanian, S. (2024). GenAI under the hood [Part 1] — Tokenizers and why you should care. https://community.aws/content/2ee0thtnVxZmFvpDUZFSck2ixOM/genai-under-the-hood-part-1---tokenizers-and-why-you-should-care

[6] Awan. A. (2023). What is Tokenization? https://www.datacamp.com/blog/what-is-tokenization

[7] Ali, M. (2024). Estimating The Cost of GPT Using The tiktoken Library in Python. https://www.datacamp.com/tutorial/estimating-cost-of-gpt-using-tiktoken-library-python

[8] Gomede, E. (2024). Byte Pair Encoding (BPE): Bridging Efficiency and Effectiveness in Language Processing. https://ai.plainenglish.io/byte-pair-encoding-bpe-bridging-efficiency-and-effectiveness-in-language-processing-143513108c47

--

--