BERT Tokenizer: Tokenizer for English
This article presents an overview of the BERT Tokenizer is a tokenizer for English used in the BERT language processing model that we introduced in a previous blog post.
About BERT Tokenizer
The BERT Tokenizer is a tokenizer for English used in the BERT language processing model. It takes English text as input and converts it into a sequence of tokens that can be processed by the AI model.
The BERT Tokenizer uses WordPiece for subword segmentation. The vocabulary for WordPiece is defined below.
Algorithm
BERT has two variants: UNCASED and CASED. With the UNCASED version, uppercase and lowercase letters are treated the same, with all text being converted to lowercase. Conversely, with the CASED version letters of different casing are treated differently.
In the BERT Tokenizer, the input string is first split into words using Python’s standard split
method, using spaces, line breaks, and tabs as delimiters. Additionally, words are also split based on punctuation marks with the Unicode P property (such as !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ).
For the UNCASED variant, accents are also removed. After converting the text to lowercase using lower()
, the _run_strip_accents
method performs Unicode NFD normalization to decompose characters with diacritics. It then removes characters with the Unicode category "Mn" to produce text without accents. When Japanese is input, voicing marks (aka. dakuten) are also removed, causing characters like "で" to be converted to "て".
Next, words are further split into subwords using WordPiece. Subword segmentation is performed using a greedy longest-match-first strategy based on the word list defined in the vocabulary.
Examples
Let’s take the text “To be or not to be, that is the question”
as input. It is first split by spaces and punctuation marks into: [‘to’, ‘be’, ‘or’, ‘not’, ‘to’, ‘be’, ‘,’, ‘that’, ‘is’, ‘the’, ‘question’]
Next, it is further divided using WordPiece, resulting in token IDs [2000, 2022, 2030, 2025, 2000, 2022, 1010, 2008, 2003, 1996, 3160]
. When decoded, this converts back to the original “to be or not to be, that is the question”
Using the BERT Tokenizer from ailia Tokenizer
Our company offers ailia Tokenizer, which can be used on iOS and Android as well since it is available in C++, Flutter, Unity (C#), and Python.
The BERT Tokenizer is available starting from ailia Tokenizer version 1.3.
Here is an example of implementing the BERT Tokenizer using ailia Tokenizer in C++.
AILIATokenizer *net;
ailiaTokenizerCreate(&net, AILIA_TOKENIZER_TYPE_BERT, AILIA_TOKENIZER_FLAG_NONE);
ailiaTokenizerOpenVocabFile(net, "./test/gen/bert/tokenizer/vocab.txt");
ailiaTokenizerOpenTokenizerConfigFile(net, "./test/gen/bert/tokenizer/tokenizer_config.json");
ailiaTokenizerEncode(net, u8"To be or not to be, that is the question");
unsigned int count;
ailiaTokenizerGetTokenCount(net, &count);
std::vector<int> tokens(count);
ailiaTokenizerGetTokens(net, &tokens[0], count);
ailiaTokenizerDestroy(net);
ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.
ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.