Tokenization in NLP : All you need to know

8 min readJan 30, 2024

Natural Language Processing (NLP) has rapidly evolved in recent years, enabling machines to understand and process human language. At the core of any NLP pipeline lies tokenization, a fundamental step that breaks down unstructured text into discrete elements. In this article, we will explore the significance of tokenization and discuss its Types, Challenges and Tools

Why do we need Tokenization?

Unstructured text data, such as articles, social media posts, or emails, lacks a predefined structure that machines can readily interpret. Tokenization bridges this gap by breaking down the text into smaller units called tokens. These tokens can be words, characters, or even subwords, depending on the chosen tokenization strategy. By transforming unstructured text into a structured format, tokenization lays the foundation for further analysis and processing.

One of the primary reasons for tokenization is to convert textual data into a numerical representation that can be processed by machine learning algorithms. With this numeric representation we can train the model to perform various tasks, such as classification, sentiment analysis, or language generation.

Tokens not only serve as numeric representations of text but can also be used as features in machine learning pipelines. These features capture important linguistic information and can trigger more complex decisions or behaviors. For example, in text classification, the presence or absence of specific tokens can influence the prediction of a particular class. Tokenization, therefore, plays a pivotal role in extracting meaningful features and enabling effective machine learning models.

Natural language is inherently ambiguous, with words often having multiple meanings depending on the context. Tokenization helps disambiguate the text by splitting it into individual tokens, which can then be analyzed in the context of their surrounding tokens. This context-aware approach provides a more nuanced understanding of the text and improves the accuracy of subsequent NLP tasks.

Different strategies for Tokenization

There are several tokenization strategies one can adopt, and the optimal splitting of words into subunits is usually learned from the corpus.

First let’s consider two extreme cases: character tokenization and word tokenization.

1.Character tokenization

The simplest tokenization scheme is to feed each character individually to the model. In Python, str objects are really arrays under the hood, which allows us to quickly implement character-level tokenization with just one line of code:

text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ',
'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 's', 'k', ' ', 'o',
'f', ' ', 'N', 'L', 'P', '.']

This is a good start, but we’re not done yet. Our model expects each character to be converted to an integer, a process sometimes called numericalization. One simple way to do this is by encoding each unique token (which are characters in this case) with a unique integer:

token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

{' ': 0, '.': 1, 'L': 2, 'N': 3, 'P': 4, 'T': 5, 'a': 6, 'c': 7, 'e': 8, 'f': 9,
'g': 10, 'i': 11, 'k': 12, 'n': 13, 'o': 14, 'r': 15, 's': 16, 't': 17, 'x': 18,
'z': 19}

This gives us a mapping from each character in our vocabulary to a unique integer. We can now use token2idx to transform the tokenized text to a list of integers:

input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7,
14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]

Each token has now been mapped to a unique numerical identifier (hence the name input_ids). The last step is to convert input_ids to a 2D tensor of one-hot vectors. One-hot vectors are frequently used in machine learning to encode categorical data. We can create the one-hot encodings in PyTorch by converting input_ids to a tensor and applying the one_hot() function as follows:

import torch
import torch.nn.functional as F
input_ids = torch.tensor(input_ids)
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))
one_hot_encodings.shape

torch.Size([38, 20])

For each of the 38 input tokens we now have a one-hot vector with 20 dimensions, since our vocabulary consists of 20 unique characters.

By examining the first vector, we can verify that a 1 appears in the location indicated by input_ids[0]:

print(f"Token: {tokenized_text[0]}")
print(f"Tensor index: {input_ids[0]}")
print(f"One-hot: {one_hot_encodings[0]}")

Token: T
Tensor index: 5
One-hot: tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Challenges of Character tokenization

From our simple example we can see that character-level tokenization ignores any structure in the text and treats the whole string as a stream of characters.

Although this helps deal with misspellings and rare words, the main drawback is that linguistic structures such as words need to be learned from the data. This requires significant compute, memory, and data. For this reason, character tokenization is rarely used in practice.

Instead, some structure of the text is preserved during the tokenization step. Word tokenization is a straightforward approach to achieve this, so let’s take a look at how it works.

2.Word Tokenization

Instead of splitting the text into characters, we can split it into words and map each word to an integer. Using words from the outset enables the model to skip the step of learning words from characters, and thereby reduces the complexity of the training process.

One simple class of word tokenizers uses whitespace to tokenize the text. We can do this by applying Python’s split() function directly on the raw text :

tokenized_text = text.split()
print(tokenized_text)

['Tokenizing', 'text', 'is', 'a', 'core', 'task', 'of', 'NLP.']

From here we can take the same steps we took for the character tokenizer to map each word to an ID.

Challenges of word tokenization

However, we can already see one potential problem with this tokenization scheme: punctuation is not accounted for, so “NLP.” is treated as a single token. Given that words can include declinations, conjugations, or misspellings, the size of the vocabulary can easily grow into the millions!

Having a large vocabulary is a problem because it requires neural networks to have an enormous number of parameters. To illustrate this, suppose we have 1 million unique words and want to compress the 1-million-dimensional input vectors to 1-thousanddimensional vectors in the first layer of our neural network. This is a standard step in most NLP architectures, and the resulting weight matrix of this first layer would contain 1 million × 1 thousand = 1 billion weights. This is already comparable to the largest GPT-2 model,4 which has around 1.5 billion parameters in total!

Naturally, we want to avoid being so wasteful with our model parameters since mod‐ els are expensive to train, and larger models are more difficult to maintain. A com‐ mon approach is to limit the vocabulary and discard rare words by considering, say, the 100,000 most common words in the corpus. Words that are not part of the vocabulary are classified as “unknown” and mapped to a shared UNK token. This means that we lose some potentially important information in the process of word tokenization, since the model has no information about words associated with UNK.

Wouldn’t it be nice if there was a compromise between character and word tokenization that preserved all the input information and some of the input structure? There is… subword tokenization.

3. Subword Tokenization

The basic idea behind subword tokenization is to combine the best aspects of character and word tokenization.

On the one hand, we want to split rare words into smaller units to allow the model to deal with complex words and misspellings. On the other hand, we want to keep frequent words as unique entities so that we can keep the length of our inputs to a manageable size.

The main distinguishing feature of subword tokenization (as well as word tokenization) is that it is learned from the pre‐ training corpus using a mix of statistical rules and algorithms.

There are several subword tokenization algorithms that are commonly used in NLP, but let’s start with WordPiece. which is used by the BERT and DistilBERT tokenizers. The easiest way to understand how WordPiece works is to see it in action.

Transformers library provides a convenient AutoTokenizer class that allows you to quickly load the tokenizer associated with a pretrained model — we just call its from_pretrained() method, providing the ID of a model on the Hugging-Face Hub or a local file path.

Let’s start by loading the tokenizer for DistilBERT:

from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

The AutoTokenizer class belongs to a larger set of “auto” classes whose job is to auto‐ matically retrieve the model’s configuration, pretrained weights, or vocabulary from the name of the checkpoint. This allows you to quickly switch between models, but if you wish to load the specific class manually you can do so as well. For example, we could have loaded the DistilBERT tokenizer as follows:

from transformers import DistilBertTokenizer
distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

Let’s examine how this tokenizer works by feeding it our simple “Tokenizing text is a core task of NLP.” example text:

encoded_text = tokenizer(text)
print(encoded_text)

{'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953,
2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Just as with character tokenization, we can see that the words have been mapped to unique integers in the input_ids field. We’ll discuss the role of the attention_mask field in the next section. Now that we have the input_ids, we can convert them back into tokens by using the tokenizer’s convert_ids_to_tokens() method

tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl',
'##p', '.', '[SEP]']

We can observe three things here. First, some special [CLS] and [SEP] tokens have been added to the start and end of the sequence.These tokens differ from model to model, but their main role is to indicate the start and end of a sequence.

Second, the tokens have each been lowercased, which is a feature of this particular checkpoint.

Finally, we can see that “tokenizing” and “NLP” have been split into two tokens, which makes sense since they are not common words. The ## prefix in ##izing and ##p means that the preceding string is not whitespace; any token with this prefix should be merged with the previous token when you convert the tokens back to a string.

The AutoTokenizer class has a convert_tokens_to_string() method for doing just that, so let’s apply it to our tokens:

print(tokenizer.convert_tokens_to_string(tokens))

[CLS] tokenizing text is a core task of nlp. [SEP]

The AutoTokenizer class also has several attributes that provide information about the tokenizer. For example, we can inspect the vocabulary size:

tokenizer.vocab_size

and the corresponding model’s maximum context size:

tokenizer.model_max_length

In this article, We have explored the significance of tokenization in NLP and implemented it using Python. While it may seem like a straightforward topic at first, delving into the intricacies of each tokenizer model reveals its complexity.

It is recommended to practice with the provided examples and apply them to various text datasets. The more you practice, the deeper your understanding of tokenization will become.