Understanding Byte-Pair Encoding

Henry Wu
6 min readJan 26, 2024

--

This post explores the process of Byte-Pair Encoding, from handling raw training text and pre-tokenization to constructing vocabularies and tokenizing new text.

Photo by Surendran MP on Unsplash

Background

Tokenization is a fundamental process in natural language processing (NLP) that involves breaking down a text sequence into smaller, more manageable units known as tokens. Various tokenization methods exist, each with its approach and applications:

  1. Word Tokenization: This method breaks text into individual words based on a delimiter. It’s straightforward but can struggle with a large vocabulary size.
  2. Character Tokenization: This method breaks text down into individual characters. While it drastically reduces the vocabulary size, it often fails to capture the semantic meaning of longer word sequences.
  3. Subword Tokenization: This method strikes a balance between word and character tokenization. It breaks text down into subwords, which are larger than individual characters but smaller than whole words. This approach, in principle, keeps frequently used words in their whole form but breaks rare words down into more meaningful subwords.

Byte-pair encoding (BPE) is a popular example of subword tokenization. In the following sections, we will delve into the specifics of BPE.

Byte-Pair Encoding (BPE)

Byte-pair encoding was first introduced in 1994 as a simple data compression technique by iteratively replacing the most frequent pair of bytes in a sequence with a single, unused byte. It has been adapted [1] for natural language processing, particularly for tokenization. Here, instead of bytes, BPE merges frequent characters or character sequences.

Pre-Tokenization

BPE relies on a pre-tokenization step that first splits the training text into words. For example, consider this sample text [1]:

“low low low low low lower lower newest newest newest newest newest newest widest widest widest”

The pre-tokenization can be as simple as space tokenization or a more advanced rule-based method. The key step is to split the text into words based on spaces and append a special end-of-word symbol ‘_’ to each word. This symbol is important as it marks word boundaries, which prevents the algorithm from confusing the end of one word with the start of another.

Following this, we count the frequency of each unique word in the text. Using our example, we would get the following set of unique words along with their respective frequencies:

(low_: 5, lower_: 2, newest_: 6, widest_: 3)

This set is used by BPE to construct vocabularies, which we will talk about next.

Byte-Pair Encoding Algorithm Overview

Byte-pair encoding consists of two main phases:

  1. Vocabulary Construction: This phase takes the set of words along with their frequencies to iteratively construct a set of vocabularies (tokens). It does so by repeatedly merging the most frequent pairs of characters or tokens. The size of this vocabulary is a parameter that can be adjusted.
  2. Tokenization: This phase uses the constructed vocabulary and the learned merge rules to tokenize new text. It does so by breaking down the new text into the tokens that were identified in the vocabulary construction phase.

In the following sections, we will explore each of these phases in more detail.

Vocabulary Construction

The vocabulary construction involves the following steps, using the set of unique words and their frequencies obtained from pre-tokenization:

(low_: 5, lower_: 2, newest_: 6, widest_: 3)

1. Base Vocabulary Creation

Create the base vocabulary from all unique symbols (characters) present in the word set:

vocabs = (l, o, w, e, r, n, s, t, i, d, _)

2. Represent the Words with Base Vocabs

Express each word as symbols from the base vocabulary:

((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, e, s, t, _): 6, (w, i, d, e, s, t, _): 3)

3. Vocabulary Merging

Iteratively merge the most frequent pairs of symbols:

  • Merge 1: Merge the most frequent pair (e, s), which occurs 6 + 3 = 9 times, to form the newly merged symbol ‘es’. Update the vocabulary and replace every occurrence of (e, s) with ‘es’:

vocabs = (l, o, w, e, r, n, s, t, i, d, _, es)

((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, es, t, _): 6, (w, i, d, es, t, _): 3)

  • Merge 2: Merge the most frequent pair (es, t), which occurs 6 + 3 = 9 times, to form the newly merged symbol ‘est’. Update the vocabulary and replace every occurrence of (es, t) with ‘est’:

vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est)

((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, est, _): 6, (w, i, d, est, _): 3)

  • Merge 3: Merge the most frequent pair (est, _), which occurs 6 + 3 = 9 times, to form the newly merged symbol ‘est_’. Update the vocabulary and replace every occurrence of (est, _) with ‘est_’:

vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_)

((l, o, w, _): 5, (l, o, w, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)

  • Merge 4: Merge the most frequent pair (l, o), which occurs 5 + 2 = 7 times, to form the newly merged symbol ‘lo’. Update the vocabulary and replace every occurrence of (l, o) with ‘lo’:

vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo)

((lo, w, _): 5, (lo, w, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)

  • Merge 5: Merge the most frequent pair (lo, w), which occurs 5 + 2 = 7 times, to form the newly merged symbol ‘low’. Update the vocabulary and replace every occurrence of (lo, w) with ‘low’:

vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)

((low, _): 5, (low, e, r, _): 2, (n, e, w, est_): 6, (w, i, d, est_): 3)

4. Final Vocabulary & Merge Rules

Continue merging until reaching the desired vocabulary size. The final vocabulary and merge rules after our five merges would be:

vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)

(e, s) → es, (es, t) → est, (est, _) → est_, (l, o) → lo, (lo, w) → low

Tokenize New Text

The tokenization procedure in Byte-Pair Encoding uses the constructed vocabulary and the learned merge rules to tokenize new text. Let’s tokenize a sample new text:

“newest binded lowers”

1. Pre-Tokenization for New Text

Similar to the initial step, we pre-tokenize the new text by splitting it into words and appending the end-of-word symbol ‘_’. The pre-tokenized text is:

(newest_, binded_, lowers_)

2. Apply Merge Rules

We first break down the pre-tokenized text into characters:

((n, e, w, e, s, t, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _))

Then, we apply the merged rules in their learned order. In our case, the order is:

(e, s) → es, (es, t) → est, (est, _) → est_, (l, o) → lo, (lo, w) → low

  • Apply Merge Rule (e, s) → es:

((n, e, w, es, t, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _))

  • Apply Merge Rule (es, t) → est:

((n, e, w, est, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _))

  • Apply Merge Rule (est, _) → est_:

((n, e, w, est_), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _))

  • Apply Merge Rule (l, o) → lo:

((n, e, w, est_), (b, i, n, d, e, d, _), (lo, w, e, r, s, _))

  • Apply Merge Rule (lo, w) → low:

((n, e, w, est_), (b, i, n, d, e, d, _), (low, e, r, s, _))

Any token not in the vocabulary will be replaced by an unknown token “[UNK]”:

vocabs = (l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low)

((n, e, w, est_), ([UNK], i, n, d, e, d, _), (low, e, r, s, _))

3. Result of Tokenization

The new text is tokenized into the following sequence:

“newest binded lowers” =

[n, e, w, est_, [UNK], i, n, d, e, d, _, low, e, r, s, _]

Through this process, BPE efficiently tokenizes new text using the vocabulary and merge rules, which includes handling unknown words or subwords with unknown tokens.

Implementation Note: The description of the Byte-Pair Encoding (BPE) algorithm here is for clarity and understanding. In practice, BPE can be implemented in a much more optimized way compared to the step-by-step method described.

References

--

--