Easy SentencePiece for Subword Tokenization in Python and Tensorflow
How we can easily train a SentencePiece sub-word tokenizer from scratch with Python and use it in Tensorflow 2.
Lately, I have been dealing with the development of some interesting NLP projects with TensorFlow (stay tuned, I’ll be posting them soon! 😉) and wanted to take the opportunity to try and include subword tokenization. I decided to resort to SentencePiece [1] (specifically, to its unigram algorithm) due to the vast amount of positive features that it offers and its superior features with respect to all the other currently available tokenization strategies. For any project, from a simple classifier to a neural machine translator, it should be your go-to nowadays (you’ll see why in the following)!
Nevertheless, in contrast with other simpler tokenizers which are included in machine learning libraries off-the-shelf, SentencePiece needs training from scratch and it’s not always straightforward to figure out what is the fastest and most efficient way of doing so. Therefore, in this quick tutorial, I want to share with you how I did it: we will see how we can train a tokenizer from scratch on a custom dataset with SentencePiece, and include it flawlessly into any TensorFlow 2 project using tensorflow-text.