We want to introduce our new text tokenization tool: YouTokenToMe. It works 7 to 10 times faster than other popular tools for alphabetic languages and 40 to 50 times faster for logographic languages. Here we’ll tell you about YouTokenToMe and share this open-source tool with you on GitHub. You can find a link to it at the end of the article.
Today, the majority of tasks handled by neural networks have to do with text processing. However, neural networks work with numbers, so the text needs to be pre-processed before it can be fed to the model.
The following methods are popular ways of doing this:
- Splitting by space
- Rule-based tokenization such as in SpaCy, NLTK
- Lemmatization, stemming
They each have some drawbacks.
- The vocabulary size, which is what the model’s embedding layer size directly depends on, can’t be controlled.
- Information about word relationships that differ in suffixes or prefixes (example: polite vs. impolite) is not used.
- They are language-dependent.
The Byte Pair Encoding algorithm has been popular in recent times. It was originally invented for text compression, but several years ago, it began being used in text tokenization for machine translation. It is now used for many purposes, including in models such as BERT and GPT-2.
The most efficient implementations of the BPE algorithm were SentencePiece, developed by engineers from Google, and fastBPE, created by a researcher from Facebook AI Research. However, we managed to prove that tokenization could be done much faster. We optimized the BPE algorithm, published the code on GitHub and uploaded the package to PyPI.
A comparison of the speed of our algorithm to other versions is displayed below. As an example, we took the first 100 MB from a Wikipedia database in Russian, English, Japanese and Chinese.
The graphs show that the algorithm speed is highly dependent on the language. This can be explained by the larger number of characters used in Asian languages and words that aren’t separated by spaces. YouTokenToMe works 7 to 10 times faster for alphabetic languages and 40 to 50 times faster for logographic languages. Tokenization was sped up by at least 2 times, and in some tests, more than 10 times.
These results were achieved by the new algorithm through following two key features:
- Linear running time, depending on the size of the training corpus. SentencePiece and fastBPE are less efficient asymptotically.
- Efficient use of multiple threads for both training and tokenization. This increases the speed by several times.
YouTokenToMe can be used through a command-line interface and directly from Python.
You can find more information in the repository: github.com/vkcom/YouTokenToMe