[NLP] Using SentencePiece without Pretokenization

You can actually get away with not segmenting words!?

Ceshine Lee

Published in

Veritable

3 min readNov 18, 2018

I introduced four ways to tokenize Chinese documents in a previous post:

[NLP] Four Ways to Tokenize Chinese Documents

With some empirical evidence in support of sub-word segmentation techniques

medium.com

The third and fourth ways utilize SentencePiece from Google. I mentioned that SentencePiece is efficient enough so the we don’t really have to do word segmentation (pre-tokenization) beforehand.

I haven’t try using SentencePiece without pre-tokenization yet. Will try to do that in the next few development cycle.

Finally I found time to do some experiments. And this is a short post presenting the results.

Can We Skip Word Segmentation?

Table 1: Results from a Regression Problem

From my experiments, it does seem so (i.e. we can skip word segmentation). In Fact, the model that did not segment the words performed better than the one that did.

（The dataset used is a movie review dataset I collected as described in this post — “[Preview] Developing Modern Chinese NLP Models”. This time I expanded the dataset with more data to become about three times larger, and I transformed the target variable a bit. The task is to predict the user rating from a comment using a regression model.）

An Example

This is a sentence take from a real review:

唯一的看点大概就是阿汤哥怎么折磨自己了

If we do a word segmentation before applying SentencePiece, the result would be something like:

唯一 的 看点 大概 就是 阿汤哥 怎么 折磨 自己 了

Then the SentencePiece model would tokenize the above result as:

‘▁唯一’, ‘▁的’, ‘▁看’, ‘点’, ‘▁大概’, ‘▁就是’, ‘▁阿汤哥’, ‘▁怎么’, ‘▁折磨’, ‘▁自己’, ‘▁了’

The word segmentation tool did most of the work. The only thing SetencePiece did differently is to split 看点 into ‘▁看’ and ‘点’ .

Now if we apply another SentencePiece model that was trained without word segmentation before it to the original sentence. The results became:

'▁', '唯一的', '看点', '大概', '就是', '阿汤哥', '怎么', '折磨', '自己', '了'

We see that this SentencePiece model learned to segment the words by itself with a surprisingly good accuracy. This explains why the final MSE scores in Table 1 are very close.

Caveat: Vocabulary size

Table 2: Results from Language Model Pre-training

The model with word segmentation actually got lower (better) perplexities. However, perplexities from this two models are not directly comparable because of the differences in vocabularies.

Imagine we have a language with only four characters, and we have this sentence:

abcdabcbcd

Suppose that we can segment the sentence into:

abc dab cbcd

If we feed the above segmented sentence to train a SentencePiece model, the model could create two unique tokens _d, _c that would never appear if we feed the raw sentence to the model.

From the above logic, we might need to increase the size of vocabulary when we apply word segmentation before SentencePiece.

Bonus

SentencePiece also has published some results of the experiments they did regarding to pre-tokenization here:

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation. - google/sentencepiece

github.com

As you can see, pre-tokenization could still be helpful in some cases. So we shouldn’t overlook segmentation completely when we try to deal with new problems.

Source Code

The code used to do the experiments:

ceshine/modern_chinese_nlp

(WIP) My humble contribution to the democratization of the Chinese NLP technology - ceshine/modern_chinese_nlp