How to train sentencepiece tokenizers with common crawl (multilanguage)

Aki Kutvonen
5 min readOct 21, 2021

Introducing a set of common crawl pre-trained sentencepiece tokenizers for Japanese and English, and and a codebase to train more for almost any language.

Photo by Glen Carrie on Unsplash

I wanted to train a Japanese-English-Japanese translation model using sentencepiece tokenizers, but it was hard to find large data (10M+ sentence) pre-trained tokenizers. I discovered that Huggingface datasets provides an easy way to use the common crawl dataset, so I decided to use those datasets and train few models to share.

The code repository contains everything you need for setting up the environment and for training any language, which is part of the 100 languages in the common crawl 100. To be noted, training with large data requires large amounts of memory (512GB+) RAM, so it’s not something you should expect to do on your laptop. The code repository contains pre-trained models for both Japanese and English, for vocabulary sizes 8000, 16000, 32000 and 48000. Number of sentences used for each model is around 75M.

The training code

A jupyter notebook in the code repository will contain the same pieces of code introduced here, but I will add few more comments here. The functions I refer to in this article can be found here.

--

--

Aki Kutvonen

Founder of Hyouka, the more fun customer insights platform. Former theoretical physicist, tech lead and a product manager.