What is Tokenization?

Ezgi Gökdemir
SabancıDx
Published in
3 min readDec 24, 2023

What are token and tokenization, and how can some examples be applied to this concepts? We will touch upon these topics in this article.

Before we start, I touched on some basic concepts in this article, you can read it :)

In every NLP application, we need to apply a process to the text. In this way, we transform our data into a more predictable and analysable form that the computer can understand. We can call one of these processes tokenization, and we will focus on it in this article.

Tokenization is the process of dividing a piece of text into smaller units called tokens. These tokens can be words, subwords or characters.

Tokens serve as the fundamental units in Natural Language Processing (NLP). The preprocessing and analysis of textual data often occur at the token level. This token-level processing is a critical step in various NLP tasks such as sentiment analysis.

Now let’s do some examples. In the examples here I used the NLTK tool.

  • Sentence tokenization

Here we simply divided the text into sentences.

Sentence Tokenization
  • Word tokenization

Word tokenization is one of the most widely used tokenization algorithms. It breaks a piece of text into individual words.

Word Tokenization

We can also use the split function to tokenize a sentence into words, but it’s important to note that the split function may not handle all cases perfectly, especially when dealing with punctuation marks.

As you can see from the result, the “dot sign (.)” came with the word “day”. And it’s the same with the word “better”.

Word Tokenization With Split Function

Natural languages often contain ambiguous words or expressions. For example, “I can’t” can be tokenized as “I” and “can’t” or as “I” and “ca” and “n’t.” Resolving such ambiguities might require additional context or language-specific rules.

In this example, we were able to make this distinction thanks to the NLTK tool.

Word Tokenization
  • Character tokenization
Character Tokenization

This approach essentially treats each character as if it were a separate “word,” by using the nltk.word_tokenize function. Character tokens may lack semantic meaning, making it more challenging to understand the overall meaning of a text. While beneficial for certain tasks, character tokenization may not be ideal for tasks relying on word-level semantics.

In this article, I tried to cover the concept of tokenization in a more superficial way. I hope it will be useful content. Have fun reading in advance :)

--

--