[Hands-On] Build Tokenizer using BPE (Byte Pair Encoding)

Hugman Sangkeun Jung
8 min readJul 20, 2024

(You can find the Korean version of the post at this link.)

BPE Tokenization (Image by the author using ChatGPT)

In the previous post, we explored the concept of tokenization in natural language processing and specifically looked at the concepts of BPE, Byte-level BPE, WordPiece, and Unigram tokenization. This post is the first in a series on implementing major tokenization methods hands-on.

(This tutorial acknowledges that some key functions were referenced from the Huggingface article which explains BPE tokenization well.)

In this post, we’ll look at the basic concepts of BPE (Byte Pair Encoding) tokenization and how to implement it. BPE is a method of dividing text into smaller units to effectively handle rare words.

Objectives

  1. Understand the basic concepts of BPE tokenization
  2. Implement text data tokenization and build a small vocabulary
  3. Introduce the learning algorithm and word splitting method of BPE tokenization
  4. Perform BPE tokenization on example…

--

--

Hugman Sangkeun Jung

Hugman Sangkeun Jung is a professor at Chungnam National University, with expertise in AI, machine learning, NLP, and medical decision support.