Understanding Tokenization Methods: A Simple and Intuitive Guide

Hugman Sangkeun Jung
15 min readJun 8, 2024

(You can find the Korean version of the post at this link.)

In this post, we will explore the most fundamental concept in natural language processing (NLP) — tokenization. We will delve into the core ideas of various tokenization methods.

Natural language processing text data into a form that a computer can process is called ‘tokenization.’ Tokenization is the process of splitting a sentence or document into smaller units, known as tokens, which can be processed by models. This step is essential for providing sequence symbols in a form that machines can understand, whether it’s a probabilistic model or a neural network model.

Determining how to segment sentences or documents for modeling through tokenization has been a long-standing topic in NLP. As a result, several tokenization methods have been developed, each of which may be more suitable for specific situations or languages. Below is a brief summary of various tokenization methods used over time:

Common Tokenization Methods

  1. N-gram: Treats a sequence of n items as a single token. This method is useful for capturing the continuity of items within the text.
  2. Character: Treats individual characters as independent tokens. This method considers the structural properties of the language less but allows for very fine-grained analysis.
  3. Word: Uses ‘words’ separated by spaces or…

--

--

Hugman Sangkeun Jung

Hugman Sangkeun Jung is a professor at Chungnam National University, with expertise in AI, machine learning, NLP, and medical decision support.