[Hands-On] Build Tokenizer using Unigram
(You can find the English version of the post at this link.)
In our previous article, we explored the concept of tokenization in natural language processing and specifically looked at BPE, Byte-level BPE, WordPiece, and Unigram tokenization methods. This article is the third in a series on implementing major tokenization methods.
(This tutorial acknowledges that some core functions were referenced from the Huggingface article that well explains BPE tokenization.)
In this article, we will focus on the Unigram tokenization method, exploring its basic concepts and implementation.
Objectives
- Understand the basic concepts of Unigram tokenization
- Implement the entire workflow from text data tokenization to building a small vocabulary
- Introduce the learning algorithm and word segmentation method of Unigram tokenization
- Perform Unigram tokenization on example text data