Sentencepiece: A simple and language-independent subword tokenizer and detokenizer for neural text processing

Tokenization method for LLaMA, T5, XLNet

Sieun Park

Published in

CodeX

9 min readMay 19, 2023

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates [ACL 2018] [github]
Publish month(arxiv): 2018.08
Code: [Colab]
Tags: nlp, tokenization
✰ Best viewed in [Notion]

Overview

Sentencepiece is both an improved tokenization algorithm and implementation by the researchers at Google. The terms below seem overwhelming, but they are very simple and important. We will review them step-by-step, so don't be afraid!

SentencePiece is a simple, efficient, and language-independent subword tokenizer and detokenizer designed for Neural Network-based text processing systems, offering lossless tokenization, customizable character normalization, self-contained models, and on-the-fly processing capabilities.
Its efficient implementation allows pre-tokenization free(thus language independent), on-the-fly tokenization. This 1) allows dynamic sampling and noise injection during training and 2) is a step toward developing more end-to-end systems without language-specific heuristics.

SentencePiece: “a simple, efficient, and language-independent subword tokenizer”

The benefits of SentencePiece include:

1. Lossless Tokenization

detokenize denotes the process of reverting the label-encoded token ids back into text. SentencePiece implements lossless tokenization, preserving all the information required to reproduce the normalized text in the encoder’s output. i.e. detokenize(tokenize('text')) == 'text'

Wait.. was previous tokenization methods irreversible?

Raw text and tokenized sequence in previous methods such as WordPiece are often not reversibly convertible (i.e. detokenize(tokenize('text')) != 'text'). This is in particular, because of the ambiguity in whitespace information. Subword-based tokenizers first split the text by word segments, and the whitespace information is neglected during this process. For example, the sequence of tokens [“New”, “York”, “.”] might be produced from either “New York.”, “NewYork.”, or even “New York .”.

As there isn’t a one-to-one mapping between input text and token sequence, it is impossible to implement a detokenization algorithm that is loss-free. This might be especially problematic for text generation systems such as NMT and Language Modeling. What did the model mean when it said [“I”, “love”, “New”, “York”, “.”]?

*Actually the example above of deciding between “New York” and “NewYork” is already resolved since a special character “##” is added when starting from a subword, so “NewYork” will actually be split as [“New”, “##York”]. But there still are many corner cases and issues which are mentioned in the paper or will be discussed later.

Detokenization also often requires language-specific heuristics because different languages have unique rules for word boundaries and punctuation. Joining the word segments with whitespace might seem like a reasonable rule, but there are no spaces between words in Chinese or Japanese words. In such cases, a complicated word segmentation algorithm and joining heuristic have to be implemented for each language.

The authors propose simple language-agnostic lossless tokenization by simply treating the input text as a sequence of Unicode characters, including whitespace, and using a consistent encoding and decoding scheme that preserves all the information needed to reproduce the original text. In particular, SentencePiece first replaces the whitespace with a meta symbol “_” and applies the sub-tokenization algorithms such as BPE, Unigram, and WordPiece.

The whitespace can later be preserved by simply doing something like: detok = ’’.join(tokens).replace(’_’, ’ ’).

2. Blazingly fast Subword Training and Segmentation

It employs speed-up techniques for both training and segmentation, allowing it to work with large amounts of raw data without pre-tokenization.

For BPE segmentation, it adopts an O(N log(N)) algorithm where the merged symbols are managed by a binary heap (priority queue)
For unigram language model, training and segmentation complexities are linear to the size of input data.

Speed results: Compared to the slow subword-nmt library, the segmentation speed of SentencePiece is around 21k sentences/sec for English and 74k sentences/sec for Japanese, making it fast enough for on-the-fly execution.

SentencePiece shows larger performance improvements when applied to raw Japanese data (without pre-tokenization), with segmentation speed about 380 times faster than subword-nmt.
Pre-tokenization was previously necessary for Japanese text to improve tokenization speed, most likely due to the O(N²) time complexity which was restricted by the length of the word segment. SentencePiece can efficiently handle raw data and help build a purely data-driven, language-independent system.

3. On-the-fly Processing

Previously text tokenization results were precomputed in an offline manner. On-the-fly processing means that it can tokenize and detokenize text dynamically during the training or inference of a Neural Machine Translation (NMT) model. SentencePiece provides C++, Python, and TensorFlow library APIs for on-the-fly processing, which has the following benefits.

Flexibility: It allows for dynamic sampling and noise injection during NMT training(e.g. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates), which can improve the accuracy and robustness of the models.
Integration: On-the-fly processing makes it easy to integrate SentencePiece into existing NMT frameworks, as it provides C++, Python, and TensorFlow library APIs that can be directly used within the training and inference pipelines.
Reproducibility: Since the preprocessing is performed in real-time and is tightly integrated with the NMT model, it ensures better reproducibility of the experimental results, as the same preprocessing rules and parameters are consistently applied.
Efficiency: On-the-fly processing eliminates the need for separate preprocessing steps and intermediate storage of preprocessed data, reducing the overall complexity and storage requirements of the NMT pipeline.

Note that such on-the-fly Processing is possible purely due to the extremely fast inference speed of sentencepiece compared to previous tokenizers.

4. Self-contained Models

The SentencePiece model is designed to be purely self-contained, including not only the vocabulary and segmentation parameters but also the pre-compiled finite state transducer for character normalization.

Motivation: To ensure better reproducibility of experimental results and allow distribution of SentencePiece models as part of NMT models, all the rules and parameters must be self-contained into the model file.
Method: SentencePiece model is stored as a binary wire format Protocol buffer, a platform-neutral and extensible mechanism, (hopefully) ensuring safe serialization, backward compatibility, and extensibility.
The self-contained design of SentencePiece models should guarantee perfect reproducibility and allows developers to refine default normalization rules without worrying about breaking existing preprocessing behaviors.

5. Improved Performance

Experiments conducted in the paper show that SentencePiece consistently improves BLEU scores compared to the word model, especially when applied to non-segmented languages like Japanese.

In Japanese to English translation, the improvement is marginal and has no significant difference. In English to Japanese, sentencepiece significantly improves the BLEU score even without pre-tokenization. In fact, performance is degraded with pre-tokenization.

Other features:

Vocabulary ID Management: SentencePiece manages the vocabulary-to-ID mapping, enabling direct conversion of text into an ID sequence and vice versa.
Customizable Character Normalization: It supports custom normalization rules defined as a TSV file, allowing users to extend the default Unicode NFKC normalization rules for specific tasks.

Customizable Character Normalization is beneficial when the standard Unicode normalization forms don’t perfectly fit the specific needs of a task or application. For example, in some languages or scripts, certain characters or sequences of characters might be equivalent in a particular context, even though they’re not generally equivalent in Unicode.

To summarize the takeaways,

SentencePiece is a tokenizer algorithm and implementation that is fast, lossless, self-contained, performant, language-independent, and allows on-the-fly processing.
Language-specific pre-tokenization is not necessary when using SentencePiece because it doesn’t affect performance and tokenization speed.

Usage

Code: [Colab]

Direct usage of python binding

Python guide: https://github.com/google/sentencepiece/blob/master/python/README.md

To train the tokenizer, specify the raw corpus file containing one-sentence-per-line, model_type, and other model arguments. The tokenizer will be saved under the model_prefix directory. The code for preparing the Japanese and English corpus data is provided in the colab link, but it can essentially be any file.

!pip install sentencepiece

import sentencepiece as spm
t1 = time.time()
spm.SentencePieceTrainer.train(
    input='opus100-en.txt', 
    model_prefix='en-sp', 
    model_type="bpe", 
    vocab_size=10000, 
)
print("en-sentpiece google time:", time.time() - t1)py

The trained sentencepiece tokenizer saved under model_prefix can be used similarly to hugging face tokenizers. Note that the detokenization process is lossless unlike other examples based on wordpiece.

sp = spm.SentencePieceProcessor(model_file='./en-sp.model')

encoded = sp.encode(text_en)
print("len:", len(encoded), encoded)
print("------------------")
print(text_en)
print(sp.decode(encoded))

>>> len: 45 [285, 2419, 3167, 7344, 9532, 575, 9964, 6664, 9940, 6, 3684, 9948, 199, 6669, 110, 1269, 3506, 26, 1678, 8024, 67, 1304, 9922, 1678, 8024, 4885, 89, 1566, 6142, 9948, 9941, 2950, 2373, 2787, 25, 9940, 5054, 1566, 6142, 8391, 293, 4956, 9929, 233, 9932]
------------------
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation.
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation.\

Custom Huggingface tokenizer using Metaspace pre-tokenizer

Metaspace splits on whitespaces and replaces them with a special char “▁” (U+2581).

from tokenizers.pre_tokenizers import Whitespace, Metaspace

print(Whitespace().pre_tokenize_str("Hello world !"))
>>> [('Hello', (0, 5)), ('world', (6, 11)), ('!', (12, 13))]
print(Metaspace().pre_tokenize_str("Hello world !"))
>>> [('▁Hello', (0, 5)), ('▁world', (5, 11)), ('▁!', (11, 13))]

Huggingface lets you build and train tokenizers from scratch. By looking at the example below, we can make a sentencepiece tokenizer model using Whitespace pre-tokenizer and train it from scratch. You can find all the details in the link provided above.

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    Regex,
)

tokenizer = Tokenizer(model = models.WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "), # replace 2+ spaces with one.
    ]
)
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([pre_tokenizers.Metaspace(),
                                                   pre_tokenizers.Digits(individual_digits=True)])
tokenizer.post_processor = processors.TemplateProcessing(
    single="[CLS] $A [EOS]",
    special_tokens=[("[CLS]", 1), ("[EOS]", 2)],
)
special_tokens = ["[UNK]", "[PAD]", "[CLS]"]
trainer = trainers.WordPieceTrainer(vocab_size=10000, special_tokens=special_tokens, )
tokenizer.train(files=["opus100-en.txt"], trainer=trainer)
os.makedirs("./tokenizers", exist_ok=True)
tokenizer.save("./tokenizers/en-sentpiece.json")
encoded = tokenizer.encode(text_en)
print("len:", len(encoded.tokens), encoded.tokens)
print(tokenizer.decode(encoded.ids))
print("------------------")
print(tokenizer.normalizer.normalize_str(text_en))
print(tokenizer.decode(encoded.ids).replace("##", "").replace(" ", "")[1:].replace('▁', ' '))
>>> len: 49 ['[CLS]', '▁This', '▁paper', '▁describe', '##s', '▁Sen', '##ten', '##ce', '##P', '##iec', '##e,', '▁a', '▁language', '##-in', '##de', '##pen', '##den', '##t', '▁sub', '##word', '▁to', '##ken', '##ize', '##r', '▁and', '▁det', '##oke', '##n', '##ize', '##r', '▁designed', '▁for', '▁Ne', '##ural', '##-b', '##ased', '▁text', '▁process', '##ing,', '▁including', '▁Ne', '##ural', '▁Mac', '##h', '##ine', '▁Trans', '##l', '##ation.', '[EOS]']
▁This ▁paper ▁describe ##s ▁Sen ##ten ##ce ##P ##iec ##e, ▁a ▁language ##-in ##de ##pen ##den ##t ▁sub ##word ▁to ##ken ##ize ##r ▁and ▁det ##oke ##n ##ize ##r ▁designed ▁for ▁Ne ##ural ##-b ##ased ▁text ▁process ##ing, ▁including ▁Ne ##ural ▁Mac ##h ##ine ▁Trans ##l ##ation.
------------------
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation.
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation.

Overview of the tokenization pipeline: https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt#building-a-wordpiece-tokenizer-from-scratch

List of components for building custom tokenization pipeline: https://huggingface.co/docs/tokenizers/components

Using existing sentencepiece-based tokenizers

Many popular models like XLNet, T5, and LLaMa use sentencepiece. To get the same results as these models, Hugging Face has internal tokenizer implementations that work with sentencepiece. This tool can simply be reused directly. Remember, you must have sentencepiece installed to use these tokenizers.

from tokenizers import Tokenizer
from transformers import XLNetTokenizer
xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

encoded = xlnet_tokenizer.encode(text_en)
print("len:", len(encoded))

print(text_en)
print(xlnet_tokenizer.decode(encoded))encoded = xlnet_tokenizer.encode(text_en)
print("len:", len(encoded))