[Hands-On] Build Tokenizer using Unigram

11 min readAug 9, 2024

(You can find the English version of the post at this link.)

In our previous article, we explored the concept of tokenization in natural language processing and specifically looked at BPE, Byte-level BPE, WordPiece, and Unigram tokenization methods. This article is the third in a series on implementing major tokenization methods.

Unigram Tokenization (Image by the author using ChatGPT)

(This tutorial acknowledges that some core functions were referenced from the Huggingface article that well explains BPE tokenization.)

Understanding Tokenization Methods: A Simple and Intuitive Guide

Explore various tokenization methods in NLP, including BPE, WordPiece, and Unigram, to enhance text processing…

medium.com

In this article, we will focus on the Unigram tokenization method, exploring its basic concepts and implementation.

Objectives

Understand the basic concepts of Unigram tokenization
Implement the entire workflow from text data tokenization to building a small vocabulary
Introduce the learning algorithm and word segmentation method of Unigram tokenization
Perform Unigram tokenization on example text data

What is Unigram Tokenization?

[Hands-On] Build Tokenizer using Unigram

Understanding Tokenization Methods: A Simple and Intuitive Guide

Explore various tokenization methods in NLP, including BPE, WordPiece, and Unigram, to enhance text processing…

Objectives

What is Unigram Tokenization?

Written by Hugman Sangkeun Jung