Tokenization — A complete guide

Natural Language Processing | Text Preprocessing

Utkarsh Kant
9 min readJan 28, 2022

The updated version of this article has been moved to my website.

In today’s episode 📺:

  1. So what is Tokenization?
  2. And what is a Token?
  3. Why do we tokenize?
  4. Types of Tokenization
  5. A quick word on linguistics!
  6. How to tokenize?
  7. Text corpus
  8. Tokenization with `split()` function
  9. Tokenization with NLTK
  10. Tokenization with spaCy
  11. How tokenization happens under the hood in spaCy
  12. In conclusion

🤖 TLDR:
If you want to directly jump to the meat, go to the
Tokenization with spaCy section.

In the last article, we briefly discussed what NLP is, how it is being used in the industry, and how it is impacting our day-to-day lives.

We also went through the NLP Pipeline and the different steps involved. We saw that text preprocessing is an essential block in the pipeline. After acquiring data, we apply a bunch of preprocessing steps to make it ready for modeling.

One of the first steps in text preprocessing is Tokenization.

So what is Tokenization?

Tokenization is the process of creating tokens.

Tokenization

And what is a Token?

Tokens can be thought of as a building unit of the text sequence (the data).

A string is constructed by piecing different characters together. Characters come together to build a word. Then many words come together to form a sentence, and sentences form a paragraph. Many paragraphs form a document, and so on.

These units that build up the text corpus are tokens and the process of splitting a text sequence into its tokens is tokenization.

As explained above, these tokens can be:

  • characters
  • words (individual words or sets of multiple words together)
  • part of words
  • punctuations
  • sentences
  • regular expressions
  • special tokens (we will discuss these in an upcoming discussion)

Why do we tokenize?

This is a legit question. We know what tokenization is but why should we do it? How does it help with our NLP task?

As we understood, tokens are the building blocks of text in the natural language. Therefore, most of the preprocessing and modeling happens at the token level.

For example, removing stopwords, stemming, lemmatization and many other preprocessing steps happen at token levels (we will learn about them in upcoming discussions).

Even neural network architectures process individual tokens to make sense of the document. The illustration below explains that in action.

RNN processing tokens [Source]

💡 We will deep dive into RNNs and other Deep Learning applications for NLP in upcoming discussions.

Types of Tokenization

As we now know, Tokenization helps split the original text into characters, words, sentences, etc. depending upon the problem at hand.

  • Therefore, if you split the text data (or document) into words, it’s called Word Tokenization.
  • If the document is split into sentences, then it is called Sentence Tokenization.
  • Similarly, splitting the document into individual characters is known as Character Tokenization.

so on …

A quick word on linguistics!

As you are already aware, NLP heavily involves the study of the human language or linguistics. Therefore, let us quickly brush up on the concept of prefixes, suffixes, and infixes before proceeding.

  • Prefix: Character(s) at the beginning.
    Example: $, (,
  • Suffix: Character(s) at the end.
    Example: km, ), !, ?
  • Infix: Character(s) in between.
    Example: -, _, /,
  • Exception: Special entries where a certain level of knowledge and intelligence is required to decide whether to split on the punctuation or not.
    Example: US, U.S, U.S., Dr., let’s

In the following sections, we will see many examples that will clarify these concepts even further.

How to tokenize?

Right, so we have understood what tokenization is and why it is useful, let us now understand how to tokenize a given text corpus in Python.

There are multiple ways to tokenize a given text sequence and different libraries offer multiple methods and functions for it. Let’s go over a few.

Let’s code! 🚀

Text corpus

We will take a popular tweet by Naval Ravikant as our text corpus for this exercise.

Let’s also take another text corpus that has a little more complexities involved.

While the first corpus has few regular sentences, the second corpus has some complex details like acronyms, social media handle names, email, website, emoji, and more.

Let’s see how each of the tokenization methods performs for each corpus at hand.

Tokenization with `split()` function

One of the earliest methods to split a text into tokens is by the `split()` function.

Word Tokenization

By default, the split() function “splits” the text into chunks on whitespace characters.

🔍 Observations:

  • We can observe that it does a good job of splitting the text into individual words.
  • However, the punctuations have not been separated from the word, for example: “status.” and “us!”.
  • Also, notice how prefixes like “$” have not been separated from the token in “$10”.

Sentence Tokenization

Let’s try splitting the corpus on full stops.

🔍 Observations:

  • All sentences have been separated. However, except for the 1st sentence, all other sentences have extra whitespace at the beginning of the sentence.
  • Additionally, the list of sentences has an empty sentence at the end as well.

🔍 Observations:

  • We can clearly see that the split() function has not done a good job splitting the text into sentences.
  • U.S. has been split into different sentences along with the URL and email.

We clearly need a better solution to this! Let’s now look at some other libraries that will do a better job at tokenization.

Tokenization with NLTK

NLTK is a popular NLP library. It offers some great in-built tokenizers, let’s explore.

Word tokenization

NLTK offers a bunch of different methods for word tokenization. We will explore the following:

  1. word_tokenize()
  2. TreebankWordTokenizer
  3. WordPunctTokenizer
  4. RegEx

Let’s go through them one by one.

1. word_tokenize()

🔍 Observations:

  • word_tokenize() does a good job in tokenizing the individual words along with the punctuations as well.

Let’s look at some other alternate word tokenizers that NLTK offers.

2. TreebankWordTokenizer

🔍 Observations:

  • When we closely observe, we can observe that while the punctuations like “,” have been converted into tokens. But the “.” remain intact with the words, for example “status.” and “sleep.”.

This is not really a good or a bad thing. It all depends on the context of the problem that we are trying to solve.

Let’s follow through with another NLTK tokenizer.

3. WordPunctTokenizer

🔍 Observations:

  • also created correct word tokens with separate tokens for punctuations.

4. RegEx

NLTK also offers multiple methods to tokenize text sequences with Regular Expression as well.

🔍 Observations:

  • The RegEx tokenizers exclude the punctuations.

Sentence tokenization

Sentence tokenization is the process of splitting the text corpus into different sentences.

NLTK offers a few different methods for sentence tokenization as well. We will explore the following:

  1. sent_tokenize()
  2. PunktSentenceTokenizer
  3. Sentence Tokenization on Spanish text

Let’s code! 🚀

🔍 Observations:

  • Both text sequences have been successfully tokenized into different sentences.
  • Unlike the split(“.”) method, here the extra whitespaces and empty sentences have been automatically trimmed and we are presented with the clean sentences.

An Alternate method from NLTK for sentence tokenization is as follows.

2. PunktSentenceTokenizer

While the corpus here has been split into 4 different sentences as tokens. However, let’s see if it also performs the same with the complex text sequence.

🔍 Observations:

  • The PunktSentenceTokenizer has split the text corpus into 5 different sentences based on the multiple occurrences of punctuations.
    Therefore, you should be mindful of the same.

3. Sentence Tokenization on Spanish text [Bonus section🤑]

Most of the NLP work today is done for the English language. So here’s a little something for those interested in NLP for different languages.

NLTK offers tokenization methods for different languages as well. One such example is below.

Summary

While NLTK offers multiple different tokenizers for the same task, the most popular & effective tokenizers by NLTK are the word_tokenize() and sent_tokenize().

🔍 Observations:

While NLTK does a good job at tokenization. However, we can observe some shortcomings as well, where NLTK was unable to handle some of the exceptions.

  • It separated “@” from the Twitter handle name.
  • It separated the URL into 3 different tokens: “https”, “:”, “//www.example.com/.

Tokenization with spaCy

In this section, we will discuss:

  1. Word Tokenization
  2. Tokens are immutable
  3. Sentence Tokenization
  4. Visualizing Tokens & Entities

We observed that while NLTK offers some good tokenizers, however, there were some shortcomings that the spaCy library overcomes.

Not only that, but it is also much easier to tokenize with spaCy. Let’s see how!

Upgrade the spaCy library and download the English language model.
Import spaCy and load the language model.

Word Tokenization

Thanks to the internal workings of the spaCy pipeline, the tokens are automatically created by the doc object.

🔍 Observations:

  • Calling the tokens is just a single line of code with spaCy.
  • Since the doc object is the collection of all tokens, therefore, the length of the doc object directly gives us the number of tokens in the corpus.

Let’s tokenize the other corpus we have.

🔍 Observations:

  • spaCy very precisely splits the text into tokens including the prefixes, suffixes, infixes, punctuations, and exceptions as well.
  • It knows what the Twitter handle is, therefore, it does not separate the “@” from the handle name.
  • Similarly, it recognizes the full URL of the website, therefore, it successfully retains it completely.

spaCy handles all of this (and more) under the hood and gives us the tokens in just 1 line of code.

Tokens are immutable

While tokens result in a list, and list items can be modified in Python, however, tokens cannot be modified.

🔍 Observations:

  • The error (TypeError: ‘spacy.tokens.doc.Doc’ object does not support item assignment) tells us that the tokens are immutable. And it makes sense as modifying tokens would lead to modification of the original dataset.

Sentence Tokenization

Sentence Tokenization is also as easy as Word Tokenization with spaCy.

The doc.sents object gives us the sentence tokens. Let’s see how!

🔍 Observations:

  • We can see that spaCy handles both simpler and complex text sequences with such ease.

Summary

In just 1 line of code, spaCy enables us to create accurate tokens.
Which is why spaCy becomes my preferred NLP library for text preprocessing over others.
🏆

Visualizing Tokens & Entities

spaCy also allows us to visualize the tokens and entities in the corpus for better investigation of the data. It offers us a built-in visualizer called displacy.

It helps us visualize the tokens in the doc object and their relationship with each other.

token visualization in spaCy with display.

But what are entities? 🤔
Entities are the most important chunks of a particular sentence such as noun phrases, verb phrases, or both.
We will go over entities in depth in a future discussion.

How tokenization happens under the hood in spaCy

Here’s how the intelligent system of spaCy creates accurate tokens.

Refer to the diagram below.

  1. First, it splits all words on whitespaces.
  2. Then it separates the prefixes from these words.
  3. Then it repeats the same with exceptions and suffixes.
  4. In the end, what you get are correct tokens.
Tokenization in spaCy

Therefore, spaCy handles all tokens including prefixes, suffixes, infixes, and exceptions.

In conclusion

Today we discussed what tokenization is, why it is important, and how to do it.

We also saw how to tokenize a given corpus with different methods and observed how spaCy is way superior to the rest of the available methods.

Hope you enjoyed this! Feel free to leave your feedback and queries in the comments below. You can also reach out to me on:

--

--