How Modern Tokenization Works 🌟

Multilingual Techniques — Byte-pair Encoding, SentencePiece, and WordPiece

Lars Wiik
10 min readApr 21, 2024

Tokenization might seem simple — it’s just slicing text into smaller pieces, right? — Wrong.

In natural language processing (NLP), tokenization is a fundamental step that sets the stage for computers to grasp human language.

With the rapid advancements in AI technology by leaders like OpenAI, Google, and others in developing Large Language Models, understanding tokenization has never been more critical.

It’s not merely about dividing text; it’s about unlocking the linguistic meanings that fuel some of the most sophisticated AI systems today.

As an experienced Machine Learning Engineer specializing in product development by leveraging natural language processing, I recognize tokenization as a fundamental element in recent AI advancements and have decided to shed some light on the topic.

In this article, we delve deeper into modern tokenization techniques and their pivotal role in the functioning of advanced AI systems.

But before we delve into modern approaches, let's refresh our memory regarding traditional tokenization approaches.

Introduction to Traditional Tokenization 🌟

Traditional tokenization involves breaking down sentences into smaller components called tokens, typically represented as words, numbers, or punctuation marks.

This process has traditionally been tailored to each language based on its unique linguistic rules. For instance, an English tokenizer might handle contractions like “don’t” by recognizing them as single tokens, whereas a German tokenizer might split compound words to manage token size effectively.

Consider the following example of tokenizing an English sentence:

  • Original Sentence: “Technological advancements have revolutionized communication.”
  • Tokenized: [“Technological”, “advancements”, “have”, “revolutionized”, “communication”, “.”]

This example illustrates the tokenizer’s ability to distinguish between words and punctuation.

Traditional use cases

Tokenization acts as the first step in most natural language processing pipelines. Once the fundamental tokenization process has been completed, a series of linguistic text refinements has traditionally been applied. These include techniques such as POS tagging, creating n-grams, stemming, and lemmatization.

  • Part-of-Speech (POS) Tagging: This involves categorizing each token according to its function in the sentence, such as noun, verb, or adjective. POS tagging has played a crucial role in syntactic analysis and sentence understanding.
  • Creating N-Grams: N-grams are sequences of ’n’ consecutive words, which provide a context that single words in isolation do not (e.g. “not good”) for capturing negations. They have been essential for building traditional language models that predict the probability of a word based on its preceding words.
  • Stemming and Lemmatization: These processes reduce words to their base or root form. While stemming simply trims word endings to achieve this, lemmatization involves a thorough morphological analysis to ensure that the transformed word is a valid dictionary word. Both techniques help normalize textual data.

Alongside techniques like POS tagging, creating N-grams, stemming, and lemmatization, another crucial method commonly employed is Term Frequency-Inverse Document Frequency (TF-IDF).

TF-IDF is a statistical measure used to evaluate the importance of a term within a document relative to a collection of documents (corpus). It operates on the principle that words that appear frequently in a document but rarely across other documents in the corpus are likely to be more informative and thus hold more significance.

Once appropriate preprocessing steps have been applied, the refined linguistic features can serve as the foundation for different natural language processing tasks.

Historically, one of the most widespread uses of traditional NLP has been for sentiment analysis.

Modern Tokenization — Use Cases 💡

Modern NLP systems incorporate more sophisticated tokenization techniques to enhance processing and understanding in several areas, including large language models, embedding models, and reranking systems.

Large Language Models

Large Language Models (LLMs) such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are at the cutting edge of natural language processing technology.

In LLMs, tokenization is more than just a preliminary step; it is a cornerstone of their functionality. This process translates raw text into a structured format that these models can process efficiently.

The speed of tokenization is crucial during the training phase of large language models due to the sheer volume of data required. Efficient tokenization processes ensure that the models can be trained within a reasonable timeframe without compromising quality.

LLMs often employ subword tokenization techniques. These methods break down text into smaller units — subwords — capturing more linguistic nuances than mere word-level tokenization.

Embedding Models

Embedding models transform text into numerical vectors in multidimensional space, representing the textual meaning.

This transformation is crucial because it enables algorithms to process and analyze text based on its underlying meaning rather than just its superficial textual form.

Tokenization plays a critical role in this context as it determines the granularity of the text representation.

Embedding models are essential in areas such as semantic text similarity, information retrieval, and machine translation, to mention a few.

Reranking Models

Reranking models refine the outputs of search and retrieval systems by reordering a ranked list of documents or results according to their relevance to a user query. The primary goal of these models is to enhance retrieval accuracy by applying advanced similarity matching.

Tokenization is crucial for reranking systems because it allows these models to accurately parse and analyze text to assess the relevance and contextual alignment with user queries.

By breaking down text into tokens, reranking systems can effectively evaluate and compare the semantic content of documents, ensuring that the most relevant results are prioritized in the final output, hence enhancing the precision of search and retrieval systems.

In Retrieval-Augmented Generation (RAG) systems, all of the above are usually employed together to maximize the efficiency and accuracy of text-generation tasks. RAG systems combine the prowess of large language models with the capabilities of retrieval systems to enhance the quality of generated text.

Basic RAG System

If you are interested in RAG systems, check out my article on it here.

Advanced Multilingual Tokenization Techniques 🔤

Modern AI systems designed for multilingual processing utilize advanced tokenization techniques, capable of efficiently managing a diverse range of languages and dialects.

These systems break down sentences into sub-words or smaller linguistic units. The idea is that these sub-word tokenizers can be reused across multiple languages.

This allows the model to create a custom vocabulary derived from multiple languages and handle new words it hasn’t encountered before.

Sharing subwords across languages allows for transfer learning, namely learning and applying linguistic features from one language to another.

The effectiveness of sub-word tokenization supports using a single model for multiple languages, reducing the need for language-specific models.

To illustrate sub-word tokenization, let’s look at the same example as earlier:

“Technological advancements have revolutionized communication.”

This sentence could be tokenized into the following array of sub-word tokens using a multilingual tokenizer:

[“Tech”, ”nolog”, ”ical”, ”advance”, ”ments”, ”have”, ”revol”, ”ution”, ”ized”, ”commun”, ”ica”, ”tion”, “.”].

This tokenization technique not only maintains the semantics of the original sentence but also simplifies the process of transferring knowledge between languages.

For instance, parts of words like “tech” and “nolog” may appear in other technological contexts across different languages, enabling the AI to understand and process language-related tasks even when it encounters them in a language it was not originally trained on.

The model can handle morphologically rich languages (like Finnish or Turkish) more effectively using sub-word units. Such languages often form words by compounding smaller units, which can be challenging for models that only tokenize at the word level.

Popular sub-word algorithms include Byte-pair Encoding (BPE), WordPiece, and SentencePiece. Let’s have a look at them!

Subword Tokenization Algorithms 🧠

Algorithms like Byte-pair Encoding (BPE), WordPiece, and SentencePiece are commonly used to generate a subword vocabulary. These algorithms are used in the most prominent language models nowadays.

Byte-Pair Encoding (BPE)

Byte-pair Encoding originally started as a data compression technique and was later adapted for use in natural language processing as a tokenization technique for subwords. BPE is known to be faster than most other advanced tokenization techniques.

The algorithm starts with a vocabulary of individual characters and repeatedly merges the most frequent adjacent pairs of tokens into single tokens.

This approach helps manage a language model's vocabulary size, enabling it to efficiently handle rare words by decomposing them into smaller, more frequent subwords.

Byte-pair Encoding has become particularly popular in training language models like GPT (Generative Pre-trained Transformer), where it aids in handling a vast range of vocabulary without an extensively large and sparse vocabulary. Additionally, BPE’s speed is preferred when tokenizing the massive datasets these language models are trained with.

OpenAI’s GPT models use a Byte-pair Encoding (BPE) variant for tokenization. This variant handles bytes instead of Unicode characters, allowing the tokenizer to process any input text without pre-defined vocabularies.

Example of tokenizing a sentence using OpenAI’s tokenizer

Check out OpenAI’s GitHub repository called tiktoken for more information: https://github.com/openai/tiktoken.

Another model using Byte-pair Encoding is XLM (Cross-lingual Language Model) created by researchers at Facebook AI Research (FAIR). Note that XLM is the predecessor to XLM-R, which uses SentencePiece instead of BPE.

WordPiece

WordPiece begins its algorithm with whole words and strategically breaks them down into the most linguistically useful subwords, instead of combining characters like BPE does.

WordPiece was developed by researchers at Google and presented in the paper “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”.

WordPiece is utilized in Google’s mBERT, which marked a significant advancement in natural language processing by leveraging the WordPiece tokenizer to process text from 104 languages.

SentencePiece

SentencePiece takes a unique approach to tokenization that is often favored in contexts where handling multiple languages simultaneously is important, especially those without clear word boundaries.

SentencePiece processes text as a continuous stream of Unicode characters, not dividing it into discrete words as traditional tokenizers do. It employs either a Byte-Pair Encoding (BPE) method or a Unigram Language Model to analyze this stream.

The unigram model approach begins with a large pool of potential sub-words and iteratively prunes this set based on how likely each sub-word contributes to the likelihood of the text corpus. This likelihood is calculated using a probabilistic model of token occurrence.

Unlike other tokenizers, SentencePiece may create subwords using parts of adjacent words, without strictly respecting word boundaries. This feature is particularly beneficial for languages like Japanese or Chinese without clear word boundaries.

mT5 (Multilingual Translation with T5) developed by Google Research, is a multilingual variant of the Text-To-Text Transfer Transformer (T5) model. mT5 leverages SentencePiece for tokenization, enabling it to perform various tasks such as machine translation, summarization, and question answers across diverse languages.

XLMRobertaTokenizer is a well-known state-of-the-art tokenizer openly available at Huggingface that uses SentencePiece. XLMRobertaTokenizer is used in some of the most advanced and performant multilingual embedding models such as E5 and BGE, and reranking models such as bge-reranker-v2-m3.

Check out my previous article on E5 and BGE here.

Summary and Conclusion

The evolution of tokenization in natural language processing illustrates a shift from traditional language-specific methods to advanced multilingual techniques that enhance the functionality of modern AI systems.

Techniques such as Byte-pair Encoding, WordPiece, and SentencePiece have not only streamlined the processing of diverse languages but have also laid a crucial foundation for globalizing large language models, embedding models, and reranking models.

These advanced methods handle out-of-vocabulary words and improve model robustness by efficiently breaking down text into manageable, meaningful units known as sub-words.

Open-source tokenizers that utilize these techniques, such as the XLMRobertaTokenizer, will continue to drive innovation in the AI community and provide essential support for state-of-the-art embedding, reranking models, and large language models.

As we advance, the importance of tokenization remains undeniable, serving not only as a bridge between linguistic diversity and technological advancement but also as a cornerstone for the future landscape of artificial intelligence.

Thanks for reading!

And feel free to follow me to receive more insights in the future!

And do not hesitate to reach out if you have any questions!

Through my articles, I share cutting-edge insights into LLMs and AI, offer practical tips and tricks, and provide in-depth analyses based on my real-world experience. Additionally, I do custom LLM performance analyses, a topic I find extremely fascinating and important in this day and age.

My content is for anyone interested in AI and LLMs — Whether you’re a professional or an enthusiast!

Follow me if this sounds interesting!

Connect with me:

--

--

Lars Wiik

MSc in AI — LLM Engineer ⭐ — Curious Thinker and Constant Learner