Published in

All about Tokenizers

Photo by Jason Leung on Unsplash

We are learning tokenizers because machines do not read the language as is, thus it needs to be converted to numbers and that’s where tokenizers come to the rescue. Loosely speaking tokenization is splitting sentences into words. This sounds simple but there are a few caveats, mainly because we want to map each token to a number and the size and meaningfulness of this mapping thus formed is important and critical for Machine Learning tasks.

What is a tokenizer?

Tokenizer splits a text into words or sub-words, there are multiple ways this can be achieved.

For example, the text given below can be split into subwords in multiple ways:

Source: Author

Punctuation attached to the words makes them suboptimal. The presence of punctuations will make the model learn the specific representation and save it from learning the vast number of representations with every possible punctuation symbol following the word.

But it just doesn’t stop here. It would have been better if “Let’s” were tokenized as “Let” and “ ’s”. This is just one example of how tokenization differs based on what rule is applied. Different tokenization rules will yield different outputs for the same input text.

When it comes to the pre-trained model, the input during inferencing needs to be passed through the same tokenized rue as it has been through during training data processing. This is like the conventional data pre-processing techniques that are applied to the test data which are the same as that of training data.

So essentially the text is converted to smaller chunks. For example, if space and punctuation tokenization is used to split the sentences into words, then it might lead to a huge corpus of vocabulary and in turn, the huge embedding matrix. It will lead to an expensive affair, in terms of higher time complexity and increased memory.

Lets see this with an example:

example shared by author

Note that all the words are in lowercase as we are using uncased model. In addition to the punctuation issues as illustrated in the above example as well, notice that the “tokenize” word is not present in the vocabulary and hence split as [“token”, “##ize”]. The “##” signifies that this token needs to be attached with the previous one for the purpose of reproducibility during decoding

Let’s see what our next best bet is, character level tokenization? Owing to its simplistic tokenization rule, it is neither time expensive nor heavy on memory. But it takes a hit on model performance for the simple reason that it fails to learn the meaningful input representations, e.g. learning the representation for “C” vs “Covid” — which one appears easier to learn in terms of meaningful representation?

So, we have seen the word and character tokenization both and would resort to the hybrid approach called subword tokenization.

Subword tokenizer and its characteristics:

  • Supports reasonable vocabulary size
  • Learns meaningful context-independent representations
  • Processes words not seen before

Let’s see how?

It works by joining the frequently used subwords together to form relatively rarer words. The meaning of such a rare word is derived from the composite meaning of its constituent subwords. Following this principle, it has the advantage of processing previously unseen words as it tries to break them into known subwords.

Image source:

Let’s learn what are the three key subword tokenizers:

  • BPE (Byte pair encoding)
  • Word-piece
  • Sentence-piece

Byte Pair encoding:

I have tried explaining the BPE subword tokeinzation process using below image. Hopefully, it will help you understand the various steps, in terms of pre-tokenization, base vocabulary formation and how merge rules lead to an updated vocabulary.

Image created by author with example sourced from references

If a new word “bug” appears, based on the rules learned from BPE model training, it would be tokenized as [“b”, “ug”]. In case the new words constitute symbols that are not present in the base vocabulary, e.g. “mug” contains the symbol “m” that was not originally present in the vocabulary will be replaced by the “<unk>” symbol. Note that this <unk> symbol is not needed if all base characters are present in the vocabulary. This is achieved using byte-level BPE.

WordPiece: It works similar to BPE with the difference that it chooses the symbol pair that would maximize the language-model likelihood of the training data once added to the vocabulary, instead of the frequently occurring pair in BPE.

SentencePiece: An unsupervised tokenizer that includes spaces in the set of characters and then applies BPE to generate the vocabulary.

Key characteristics:

  • Supports two segmentation algorithms, namely byte-pair-encoding (BPE) and unigram language model
  • Fixed vocabulary size as it is practically impossible to include all the words in your vocabulary to output a numerical representation. example: glory and glorify, dignity vs dignified, they are semantically same and need not be assigned a separate vector.
  • Directly generates vocabulary to id mapping
  • Agnostic to the language i.e. treats sentences as a sequence of unicode characters and is free from language-dependent logic.
  • Trains directly from raw sentences and does not always require pre-tokenization
  • Uses regularization methods like subword regularization and BPE-dropout that enable the data augmentation by performing on-the-fly subword sampling. This techniques produces robust models and improves the accuracy of the neural machine translation models.


We started with understanding what is a tokenizer and why are they needed in the first place. Then, we learnt three types of tokenizers — character, word and sub-word tokenizers. Within sub-tokenizers, there are three key tokenizers — BPE, WordPiece and SentencePiece





Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

First Impressions Of

AutoEncoders Explained

K-fold cross validation explained:

Learning Optimization(SGD) Through Examples

Image Data Augmentation with Keras

Learning Word Embeddings

Linear Regression Using Normal Equations

Dealing Machine/deep learning in it’s way

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vidhi Chugh

Vidhi Chugh

Data Scientist

More from Medium

Comparison of Basic Deep Learning Cloud Platforms

Machine learning frontiers: what you must know before getting started

Adventures in Deploying a Deep Learning Model in the Browser

Explanation of “Attention Is All You Need” with Code by Abhishek Thakur