[Hands-On] Build Tokenizer using Unigram

Hugman Sangkeun Jung
11 min readAug 9, 2024

(You can find the English version of the post at this link.)

In our previous article, we explored the concept of tokenization in natural language processing and specifically looked at BPE, Byte-level BPE, WordPiece, and Unigram tokenization methods. This article is the third in a series on implementing major tokenization methods.

Unigram Tokenization (Image by the author using ChatGPT)

(This tutorial acknowledges that some core functions were referenced from the Huggingface article that well explains BPE tokenization.)

In this article, we will focus on the Unigram tokenization method, exploring its basic concepts and implementation.

Objectives

  • Understand the basic concepts of Unigram tokenization
  • Implement the entire workflow from text data tokenization to building a small vocabulary
  • Introduce the learning algorithm and word segmentation method of Unigram tokenization
  • Perform Unigram tokenization on example text data

What is Unigram Tokenization?

--

--

Hugman Sangkeun Jung

Hugman Sangkeun Jung is a professor at Chungnam National University, with expertise in AI, machine learning, NLP, and medical decision support.