[NLP] Using SentencePiece without Pretokenization

You can actually get away with not segmenting words!?

Ceshine Lee
Veritable
3 min readNov 18, 2018

--

I introduced four ways to tokenize Chinese documents in a previous post:

The third and fourth ways utilize SentencePiece from Google. I mentioned that SentencePiece is efficient enough so the we don’t really have to do word segmentation (pre-tokenization) beforehand.

I haven’t try using SentencePiece without pre-tokenization yet. Will try to do that in the next few development cycle.

Finally I found time to do some experiments. And this is a short post presenting the results.

Can We Skip Word Segmentation?

Table 1: Results from a Regression Problem

From my experiments, it does seem so (i.e. we can skip word segmentation). In Fact, the model that did not segment the words performed better than the one that did.

(The dataset used is a movie review dataset I collected as described in this post — “[Preview] Developing Modern Chinese NLP Models”. This time I expanded the dataset with more data to become about three times larger, and I transformed the target variable a bit. The task is to predict the user rating from a comment using a regression model.)

An Example

This is a sentence take from a real review:

If we do a word segmentation before applying SentencePiece, the result would be something like:

Then the SentencePiece model would tokenize the above result as:

The word segmentation tool did most of the work. The only thing SetencePiece did differently is to split 看点 into ‘▁看’ and ‘点’ .

Now if we apply another SentencePiece model that was trained without word segmentation before it to the original sentence. The results became:

We see that this SentencePiece model learned to segment the words by itself with a surprisingly good accuracy. This explains why the final MSE scores in Table 1 are very close.

Caveat: Vocabulary size

Table 2: Results from Language Model Pre-training

The model with word segmentation actually got lower (better) perplexities. However, perplexities from this two models are not directly comparable because of the differences in vocabularies.

Imagine we have a language with only four characters, and we have this sentence:

Suppose that we can segment the sentence into:

If we feed the above segmented sentence to train a SentencePiece model, the model could create two unique tokens _d, _c that would never appear if we feed the raw sentence to the model.

From the above logic, we might need to increase the size of vocabulary when we apply word segmentation before SentencePiece.

Bonus

SentencePiece also has published some results of the experiments they did regarding to pre-tokenization here:

As you can see, pre-tokenization could still be helpful in some cases. So we shouldn’t overlook segmentation completely when we try to deal with new problems.

Source Code

The code used to do the experiments:

Check scripts under scripts/sentiment_analysis and scripts/language_model for more directions on where to look.

Sorry for the lack of documentation in that repo. The codebase is going through a major overhaul right now. Will try to turn it into a more documented and modular one.

Fin

Thanks for reading! Please consider giving this post some claps if you find it helpful.

--

--

Ceshine Lee
Veritable

Data Geek. Maker. Researcher. Twitter: @ceshine_en