How to train sentencepiece tokenizers with common crawl (multilanguage)
Introducing a set of common crawl pre-trained sentencepiece tokenizers for Japanese and English, and and a codebase to train more for almost any language.
I wanted to train a Japanese-English-Japanese translation model using sentencepiece tokenizers, but it was hard to find large data (10M+ sentence) pre-trained tokenizers. I discovered that Huggingface datasets provides an easy way to use the common crawl dataset, so I decided to use those datasets and train few models to share.
The code repository contains everything you need for setting up the environment and for training any language, which is part of the 100 languages in the common crawl 100. To be noted, training with large data requires large amounts of memory (512GB+) RAM, so it’s not something you should expect to do on your laptop. The code repository contains pre-trained models for both Japanese and English, for vocabulary sizes 8000, 16000, 32000 and 48000. Number of sentences used for each model is around 75M.
The training code
A jupyter notebook in the code repository will contain the same pieces of code introduced here, but I will add few more comments here. The functions I refer to in this article can be found here.