Two minutes NLP — A Taxonomy of Tokenization Methods

Word-level, Character-level, BPE, WordPiece, and SentencePiece

Published in

NLPlanet

5 min readJan 25, 2022

Summary of the most used tokenization methods. Image by the author.

Tokenization consists in splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. In this article, you’ll see what are the main tokenization methods and where they are currently used. I suggest that you also have a look at this summary of tokenizers made by Hugging Face for a more in-depth guide.

Word-Level Tokenization

Word-level tokenization consists in splitting the text into units which are words. To do it properly, there are some precautions to consider.

Space and Punctuation Tokenization

Splitting a text into smaller chunks is harder than it looks, and there are multiple ways of doing so. For example, let’s look at the following sentence:

"Don't you like science? We sure do."

A simple way of tokenizing this text is to split it by spaces, which would give:

["Don't", "you", "like", "science?", "We", "sure", "do."]

If we look at the tokens "science?" and "do.", we notice that the punctuation is attached to the words "science" and "do", which is suboptimal. We should take the punctuation into account so that a model does not have to learn a different representation of a word and every possible punctuation symbol that could follow it, which would explode the number of representations the model has to learn.

Taking punctuation into account, tokenizing our text would give:

["Don", "'", "t", "you", "like", "science", "?", "We", "sure", "do", "."]

Rule-based Tokenization

The previous tokenization is somewhat better than pure space-based tokenization. However, we can further improve how the tokenization dealt with the word "Don't". "Don't" stands for "do not", so it would be better tokenized with something like ["Do", "n't"]. Other several ad-hoc rules can further improve tokenization.

However, depending on the rules we apply for tokenizing a text, a different tokenized output is generated for the same text. As a consequence, a pre-trained model only performs properly if you feed it an input that was tokenized with the same rules that were used to tokenize its training data.

The Problem with Word-Level Tokenization

Word-level tokenization can lead to problems for massive text corpora, as it generates a very big vocabulary. For example, the Transformer XL language model uses space and punctuation tokenization, resulting in a vocabulary size of 267k.

As a result of such a large vocabulary size, the model has a huge embedding matrix as the input and output layer, which increases both the memory and time complexity. To give a reference value, transformer models rarely have vocabulary sizes greater than 50,000.

Character-Level Tokenization

So if word-level tokenization is not ok, why not simply tokenize on characters?

Even though character tokenization would greatly reduce memory and time complexity, it makes it much more difficult for the model to learn meaningful input representations. E.g. learning a meaningful context-independent representation for the letter "t" is much harder than learning a context-independent representation for the word "today".

Therefore, character tokenization often leads to a loss of performance. To get the best of both worlds, transformers models often use a hybrid between word-level and character-level tokenization called subword tokenization.

Subword Tokenization

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

For instance "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". Both "annoying" and "ly" as stand-alone subwords would appear more frequently while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly".

In addition to allowing the model’s vocabulary size to be reasonable, subword tokenization allows it to learn meaningful context-independent representations. Moreover, subword tokenization can be used to process words the model has never seen before, by breaking them down into known subwords.

Let’s see now several different ways of doing subword tokenization.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) relies on a pre-tokenizer that splits the training data into words (such as with space tokenization, used in GPT-2 and Roberta).

After pre-tokenization, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words of the corpus, and learns merge rules to form a new symbol from two symbols of the base vocabulary. This process iterates until the vocabulary has attained the desired vocabulary size.

WordPiece

WordPiece, used for BERT, DistilBERT, and Electra, is very similar to BPE. WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.

Intuitively, WordPiece is slightly different from BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.

Unigram

In contrast to BPE or WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and the most common substrings. Unigram is often used in conjunction with SentencePiece.

SentencePiece

All tokenization algorithms described so far have the same problem: it is assumed that the input text uses spaces to separate words. However, not all languages use spaces to separate words.

To solve this problem generally, SentencePiece treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or Unigram algorithm to construct the appropriate vocabulary.

Examples of models using SentencePiece are ALBERT, XLNet, Marian, and T5.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

NLPlanet related posts

Awesome NLP — 21 popular NLP libraries of 2022

The landscape of NLP libraries

medium.com

Two minutes NLP — Learn TF-IDF with easy examples

Term Frequency, Inverse Document Frequency, and Information Retrieval

medium.com

Two minutes NLP — Four different approaches to Text Summarization

Word frequencies, TextRank, Sentence embeddings clustering, and seq-to-seq models

medium.com

Two minutes NLP — A Taxonomy of Tokenization Methods

Word-level, Character-level, BPE, WordPiece, and SentencePiece

Word-Level Tokenization

Space and Punctuation Tokenization

Rule-based Tokenization

The Problem with Word-Level Tokenization

Character-Level Tokenization

Subword Tokenization

Byte-Pair Encoding (BPE)

WordPiece

Unigram

SentencePiece

Awesome NLP — 21 popular NLP libraries of 2022

The landscape of NLP libraries

Two minutes NLP — Learn TF-IDF with easy examples

Term Frequency, Inverse Document Frequency, and Information Retrieval

Two minutes NLP — Four different approaches to Text Summarization

Word frequencies, TextRank, Sentence embeddings clustering, and seq-to-seq models

Written by Fabio Chiusano