Byte-level BPE, an universal tokenizer but…

9 min readJul 3, 2020

Image source: Sate-Of-The-Art Methods For Neural Machine Translation & Multilingual Tasks

In this study, we will see that, while it is true that a BBPE tokenizer (Byte-level Byte-Pair-Encoding) trained on a huge monolingual corpus can tokenize any word of any language (there is no unknown token), it requires on average almost 70% of additional tokens when it is applied to a text in a language different from that used for its training. This information is key when it comes to choosing a tokenizer to train a natural language model like a Transformer model.

In this post, you will find the main code organized by paragraph. To get the full code, just download the study notebook Byte-level-BPE_universal_tokenizer_but.ipynb from github (nbviewer version).

Other posts in the GPT-2 series: (NLP & fastai) GPT-2 | Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)

What is a tokenizer?

Just read the great tutorial “Tokenizer summary” from Sylvain Gugger (Hugging Face)!

About the Byte-level BPE (BBPE) tokenizer

From the tutorial “Tokenizer summary”, read the paragraphs Byte-Pair Encoding and Byte-level BPE to get the best overview of a Byte-level BPE (Byte-level Byte-Pair-Encoding) and read the Abstract and Conclusion paragraphs of the original paper: Neural Machine Translation with Byte-Level Subwords (Facebook AI, 12/05/2019).

[Abstract] Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can unnecessarily take up vocabulary slots and limit its compactness. Representing text at the level of bytes and using the 256 byte set as vocabulary is a potential solution to this issue. High computational cost has however prevented it from being widely deployed or used in practice. In this paper, we investigate byte-level subwords, specifically byte-level BPE (BBPE), which is compacter than character vocabulary and has no out-of-vocabulary tokens, but is more efficient than using pure bytes only is. We claim that contextualizing BBPE embeddings is necessary, which can be implemented by a convolutional or recurrent layer. Our experiments show that BBPE has comparable performance to BPE while its size is only 1/8 of that for BPE. In the multilingual setting, BBPE maximizes vocabulary sharing across many languages and achieves better translation quality. Moreover, we show that BBPE enables transferring models between languages with non-overlapping character sets.
[Conclusion] We proposed BBPE which builds a byte-level subword vocabulary for machine translation. It results in a much more compact vocabulary than character-based ones do without the loss of performance. In multilingual settings, the former often outperforms the latter. BBPE does not have any out-of-vocabulary tokens, allowing us to transfer a model using BBPE between languages with non-overlapping vocabularies. This transfer learning paradigm is actually very generic and can be applied to any languages and datasets for performance gain or training acceleration. With the same vocabulary size, BBPE segments sentences into shorter sequences than character-based methods do, leading to faster training and inference. Our future work includes: eliminating source-target sentence length imbalance; evaluating BBPE in one-to-many and many-to-many translation settings; exploring better segmentation algorithms for byte-level subwords.

About the tokenizers and NLP libraries used in this study

The tokenizers and NLP libraries used to perform this study were:

English pre-trained GPT2 tokenizer (GPT2TokenizerFast) from the Transformers library (Hugging Face, version 3.0.0): it is a Fast GPT-2 BBPE tokenizer (backed by Hugging Face’s tokenizers library)
Portuguese trained ByteLevelBPETokenizer tokenizer from the Tokenizers library (Hugging Face, version 0.8.0)
Deep Learning library fastai v2 (fastai2, version 0.0.17) and the Wikipedia downloading functions of the file nlputils_fastai2.py

Note about the Deep Learning libraries: the Tokenizers and Transformers libraries from Hugging Face are today the most up-to-date NLP libraries (Natural Language Processing) used all over the world when fastai v2 is a great tool for training Deep Learning models, especially with powerful fastai tools like Learning rate finder, Mixed precision training, Distributed training, Gradual unfreezing, Differential learning rates and 1cycle policy.

Note about the choice of the GPT2 tokenizer: we could have chosen another pre-trained BBPE tokenizer for this study. The key point is to use BBPE tokenizers trained on huge corpus because they can thus tokenize any word of any language without using the unknown token.

Initialization

from fastai2.text.all import *
from nlputils_fastai2 import * 

# Get config of fastai2 paths
config = Config()

# setup new path_data and create the corresponding folder
lang = 'pt'
name = f'{lang}_wiki'
data_path = config['data_path']
path_data = data_path/name
path_data.mkdir(exist_ok=True, parents=True)

Download Wikipedia in Portuguese with fastai

In Wikimedia Downloads, you will find the dump of the Portuguese Wikipedia that has 1.037.991 articles at the date of the study (07/03/2020).

By selecting those with a minimum text length of 1.800, we downloaded 20% of these articles (204.315 files) which represent about 200 million words for a total size of 1.6 GB.

Note: all the following methods come from the file nlputils_fastai2.py from fastai. We did try to use as well the nlp library of Hugging Face to download the Portuguese Wikipedia but we faced an unsolved issue (see the notebook).

# Download Wikipedia in Portuguese (zip of 1.62Go)
# duration: 40m 30s
get_wiki(path_data,lang)

# Split global download file to one article by text file
dest = split_wiki(path_data,lang)

# Size of downloaded data 
num_files, num_tokens = get_num_tokens(dest)
print(f'{num_files} files - {num_tokens} tokens')

Get text and csv files of Portuguese Wikipedia articles

The text file (all the articles in one file) will allow the training of the Portuguese tokenizer and the csv one will facilitate the tests of the study.

# Create text and csv files of Wikipedia in Portuguese
dest = path_data/'docs'

# Text file
get_one_clean_file(dest,lang)

# csv file
get_one_clean_csv_file(dest,lang)

First articles from downloaded Portuguese Wikipedia

Byte-level BPE (BBPE) tokenizers from Transformers and Tokenizers (Hugging Face libraries)

We are following 3 steps in order to get 2 identical GPT2 tokenizers, one trained on an English corpus and the other on Wikipedia in Portuguese:
1. Get the pre-trained GPT2 Tokenizer (pre-training with an English corpus) from the Transformers library (Hugging Face): it will give us the tokenizer stucture we need and the pre-trained English tokenizer.
2. Train a Byte-level BPE (BBPE) Tokenizer on the Portuguese wikipedia corpus by using the Tokenizers library (Hugging Face): this will give us the vocabulary files of our GPT2 tokenizer in Portuguese.
3. Import the tokenizer config files in Portuguese into the pre-trained GPT2 Tokenizer: it will give us a tokenizer structure with the vocab in Portuguese.

One relevant point is that we trained our Portuguese Byte-level BPE tokenizer on Portuguese Wikipedia (here, 1.6 GB) in only 2min 7s. Thanks Hugging Face!

# Byte Level BPE (BBPE) tokenizers from Transformers and Tokenizers (Hugging Face libraries)

# 1. Get the pre-trained GPT2 Tokenizer (pre-training with an English corpus)
from transformers import GPT2TokenizerFast

pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tokenizer_en.pad_token = tokenizer_en.eos_token

# 2. Train a Byte Level BPE (BBPE) tokenizer on the Portuguese Wikipedia

# Get GPT2 tokenizer_en vocab size
ByteLevelBPE_tokenizer_pt_vocab_size = tokenizer_en.vocab_size
ByteLevelBPE_tokenizer_pt_vocab_size

# ByteLevelBPETokenizer Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model
from tokenizers import ByteLevelBPETokenizer

ByteLevelBPE_tokenizer_pt = ByteLevelBPETokenizer()

# Get list of paths to corpus files
paths = [str(path_data/'all_texts_ptwiki.txt')]

# Customize training with <|endoftext|> special GPT2 token
ByteLevelBPE_tokenizer_pt.train(files=paths, 
                                vocab_size=ByteLevelBPE_tokenizer_pt_vocab_size, 
                                min_frequency=2, 
                                special_tokens=["<|endoftext|>"])

# Get sequence length max of 1024
ByteLevelBPE_tokenizer_pt.enable_truncation(max_length=1024)

# save tokenizer
ByteLevelBPE_tokenizer_pt_rep = 'ByteLevelBPE_tokenizer_pt'
path_to_ByteLevelBPE_tokenizer_pt_rep = path_data/ByteLevelBPE_tokenizer_pt_rep
if not (path_to_ByteLevelBPE_tokenizer_pt_rep).exists():
    path_to_ByteLevelBPE_tokenizer_pt_rep.mkdir(exist_ok=True, parents=True)
ByteLevelBPE_tokenizer_pt.save_model(str(path_to_ByteLevelBPE_tokenizer_pt_rep))

# 3. Import the tokenizer config files in Portuguese into the pre-trained GPT2 Tokenizer

# Get the path to ByteLevelBPE_tokenizer_pt config files
ByteLevelBPE_tokenizer_pt_rep = 'ByteLevelBPE_tokenizer_pt'
path_to_ByteLevelBPE_tokenizer_pt_rep = path_data/ByteLevelBPE_tokenizer_pt_rep

# import the pre-trained GPT2TokenizerFast tokenizer with the tokenizer_pt config files
tokenizer_pt = GPT2TokenizerFast.from_pretrained(
    str(path_to_ByteLevelBPE_tokenizer_pt_rep), 
    pad_token='<|endoftext|>')

# Get sequence length max of 1024
tokenizer_pt.model_max_length = 1024

English pre-trained tokenizer on a text in 3 languages (en, pt, fr)

Let’s tokenize by an English tokenizer a text in 3 different languages (English, Portuguese, French) in order to observe the number of tokens required in each case.

# English pre-trained tokenizer on a text in 3 languages (en, pt, fr)

# text in 3 languages to be tokenized
text_en = 'Jacques-Germain Soufflot (Irancy, July 22, 1713 - Paris, August 29, 1780) was a French architect, initiator of the architectural style of Neoclassicism.'
text_pt = 'Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo.'
text_fr = 'Jacques-Germain Soufflot (Irancy, 22 juillet 1713 - Paris, 29 août 1780) était un architecte français, initiateur du style architectural du néoclassicisme.'

langs = ['en', 'pt', 'fr']
texts = [text_en,text_pt,text_fr]

for lang,text in zip(*[langs,texts]):
    print(f'({lang}) {TitledStr(text)}\n')

# number and list of classical tokens (ie, tokens separated by a blank)
for lang,text in zip(*[langs,texts]):
    print(f'({lang} - {len(text.split())} tokens) {TitledStr(text.split(" "))}\n')

# number and list of tokens 
# after the text tokenization by imported BPE GPT2TokenizerFast (trained with an English corpus...)
for lang,text in zip(*[langs,texts]):
    toks = tokenizer_en.tokenize(text)
    print(f'({lang} - {len(toks)} tokens) {TitledStr(toks)}\n')

# number and list of tokens ids
# after the text tokenization + numerization by imported BPE GPT2TokenizerFast (trained with an English corpus...)
for lang,text in zip(*[langs,texts]):
    toks_ids = tokenizer_en.encode(text)
    print(f'({lang} - {len(toks_ids)} tokens) {TitledStr(toks_ids)}\n')

# decode (back to the text)
for lang,text in zip(*[langs,texts]):
    toks_ids = tokenizer_en.encode(text)
    text_decoded = tokenizer_en.decode(toks_ids)
    print(f'({lang}) {TitledStr(text_decoded)}\n')

# graph
# source: https://matplotlib.org/3.2.1/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py
text_split = list()
toks_split = list()

for text in texts:
    text_split.append(len(text.split()))
    toks_ids = tokenizer_en.encode(text)
    toks_split.append(len(toks_ids))
    
labels = langs
xy = list(np.array([1.,2.,3.]) - 0.2)
xz = list(np.array([1.,2.,3.]) + 0.2)
y = text_split
z = toks_split

ax = plt.subplot(111)
ax.bar(xy, y, width=0.4, color='b', align='center')
ax.bar(xz, z, width=0.4, color='g', align='center')

ax.set_xlabel('languages')
ax.set_xticks(range(1,len(labels)+1))
ax.set_xticklabels(labels)
ax.set_ylabel('number of tokens')
ax.legend(['split(" ")', 'GPTTokenizerFast (en)'])

ax.set_title('Number of tokens by tokenization method and lang')

plt.show()

Number of tokens by tokenization method and lang

As we can see, even if a GPT2TokenizerFast trained with an English corpus can tokenize any text in any language, it was optimized for English: the number of generated tokens is lower for an English text than for the same text in another language with an equivalent number of words.

In our example,

English: 44 tokens (22 words)
French: 20% more tokens with 53 (21 words)
Portuguese: 40% more tokens with 62 (25 words)

English vs Portuguese tokenizer on Portuguese Wikipedia

In this paragraph, we will compare the number of tokens per Portuguese text generated respectively by the English and Portuguese tokenizers.

# English vs Portuguese tokenizer on Portuguese Wikipedia

lang = 'pt'
fname = f'all_texts_{lang}wiki.csv'
df = pd.read_csv(path_data/fname)

df2 = df.copy()

tokens_en_list = list()
num_token_by_word_en_list = list()
tokens_pt_list = list()
num_token_by_word_pt_list = list()

for index, row in df2.iterrows():
    text = row['text']
    
    tokens_en = tokenizer_en.encode(text)
    tokens_pt = tokenizer_pt.encode(text)
    
    tokens_en_list.append(tokens_en)
    tokens_pt_list.append(tokens_pt)
    
    length_text = len(text.split())
    tokens_by_word_en = len(tokens_en)/length_text
    tokens_by_word_pt = len(tokens_pt)/length_text
    
    num_token_by_word_en_list.append(tokens_by_word_en)
    num_token_by_word_pt_list.append(tokens_by_word_pt)
    
df2['tokens_en'] = tokens_en_list
df2['num_token_by_word_en'] = num_token_by_word_en_list
df2['tokens_pt'] = tokens_pt_list
df2['num_token_by_word_pt'] = num_token_by_word_pt_list

# check min
num_token_by_word_en_min = df2.num_token_by_word_en.min()
num_token_by_word_pt_min = df2.num_token_by_word_pt.min()
print('(en)',round(num_token_by_word_en_min,2))
print('(pt)',round(num_token_by_word_pt_min,2))

# check max
num_token_by_word_en_max = df2.num_token_by_word_en.max()
num_token_by_word_pt_max = df2.num_token_by_word_pt.max()
print('(en)',round(num_token_by_word_en_max,2))
print('(pt)',round(num_token_by_word_pt_max,2))

# check mean
num_token_by_word_en_mean = df2.num_token_by_word_en.mean()
num_token_by_word_pt_mean = df2.num_token_by_word_pt.mean()
print('(en)',round(num_token_by_word_en_mean,2))
print('(pt)',round(num_token_by_word_pt_mean,2))

# check increase rate and Multiplier coefficient
increase = 0.
multiplier = 0.

for tok_en,tok_pt in zip(*(tokens_en_list,tokens_pt_list)):
    increase += (len(tok_en)-len(tok_pt))/len(tok_pt)
    multiplier += len(tok_en)/len(tok_pt)
    
# Rate of increase in % from pt to en
increase_pct = increase / len(tokens_en_list)
print('Rate of increase:',round(increase_pct*100,2),'%')

# Multiplier coefficient = (Rate of increase in %, converted to number) + 1
multiplier_coef = round(increase_pct+1,2)
print('Multiplier coefficient:',multiplier_coef)

# Multiplier coefficient in % = Multiplier coefficient, converted to %
multiplier_pct = round((multiplier/len(tokens_en_list))*100,2)
print('Multiplier coefficient in %:',multiplier_pct,'%')

# graph
len_tokens_text_list = list()
for index, row in df2.iterrows():
    text = row['text']
    length_text = len(text.split())
    len_tokens_text_list.append(length_text)

tokens_en_list = df2.tokens_en.tolist()
len_tokens_en_list = [len(t) for t in tokens_en_list]

tokens_pt_list = df2.tokens_pt.tolist()
len_tokens_pt_list = [len(t) for t in tokens_pt_list]

sorted_len_tokens_text_list = sorted(len_tokens_text_list)
y_len_tokens_en_list = (12*np.array(sorted_len_tokens_text_list)).tolist() 
y_len_tokens_pt_list = (7*np.array(sorted_len_tokens_text_list)).tolist()                             

ax = plt.subplot(111)
ax.scatter(len_tokens_text_list, len_tokens_en_list)
ax.plot(sorted_len_tokens_text_list, y_len_tokens_en_list)
ax.scatter(len_tokens_text_list, len_tokens_pt_list)
ax.plot(sorted_len_tokens_text_list, y_len_tokens_pt_list)

ax.set_xlabel('length of texts')
ax.set_ylabel('length of en and pt tokens')
ax.legend(['en', 'pt'])

ax.set_title('Number of tokens by tokenization method')

plt.show()

List and number of tokens by Portuguese Wikipedia article respectively with the English and Portuguese tokenizers

On average, when a Portuguese word is tokenized with 2.25 tokens by the English tokenizer, it is tokenized with only 1.36 tokens by the Portuguese one: an increase rate of 66%!

Comparison between the number of tokens by tokenization method (en vs pt)

As we can see from the graph, the use of a BBPE tokenizer trained with an English corpus on a corpus of another language (here, Portuguese) requires on average around 70% more tokens (66% exactly) than a BBPE tokenizer trained with the same language of the corpus of application.

About the author: Pierre Guillou is an AI consultant in Brazil and France, Deep Learning and NLP researcher in the AI Lab (Unb), and professor of Artificial Intelligence (UnB). Please contact him via his Linkedin profile.