[NLP] Four Ways to Tokenize Chinese Documents

With some empirical evidence in support of sub-word segmentation techniques

Ceshine Lee
Sep 27, 2018 · 7 min read

tl;dr — If you’re starting a new Chinese NLP project, use SentencePiece to do the tokenization (you can choose to do word segmentation before that). Use the unigram language model algorithm unless you’re really short on computing resources, especially when predicting (in that case, use byte pair encoding algorithm).

This is a follow-up of a previous post about the ongoing Chinese NLP project. It’s sort of a monthly update of the project progress.

Again, the tokenization techniques introduced here can also be applied to other languages that do not use space to separate words. However, the experiment results of this Chinese dataset does not necessary extends to other language, so you might want to do your own experiments. We’ll probably provide an easier interface to train and compare different technique in this project later.

Let’s cut to the chase and show you the experiment results first. Don’t worry if you feel confused by this. You can skip this section and come back later.

(20181118 Correction: The metric used final two columns was actually MSE instead of RMSE.)

The results came from sentiment analysis models trained with a quite noisy dataset (introduced in the previous post). This is only one dataset, so please take it with a grain of salt.

Character-level models and word-level models are clearly underperforming comparing to the sub-word models (BPE and unigram). The unigram language model technique is only slight better than the byte pair encoding (BPE) technique (the difference between these two can probably be explained by noise).

Sub-word segmentation was done deterministically. So we did not utilize one of the unigram technique’s strengths — sub-word regularization.

The small dataset is sampled from the full dataset, with 15,000 training examples for each class. A validation subset with 1/3 size of the training subset is used. The small and full dataset shared the same test subset.

(Comparing logloss/perplexities from language models won’t make sense because different tokenization technique yields different vocabulary, and hence different target size.)

The notebooks are inside the notebooks folder of this project (tag 0.0.3):

The current state of the project contains a lot of redundancy, and a large portion of the code is buried inside notebooks. It’s really embarrassing. We’ll be doing some refactoring in the next development cycle.

Now let’s begin to discuss these four ways of tokenization:

Treat each (in our case, Unicode) character as one individual token. This is the technique used in the previous post.

Pros: This one requires the least preprocessing. And it reflects the fact that each Chinese character has its own meaning.

Cons: The model needs to take the relative positions of the characters into account. Bag-of-words/characters features won’t have much predictive power. The model needs to be able to learn longer-term dependencies than all other methods, and be more complex in order to learn the words.

Do word segmentation beforehand, and treat each word as a token. Because it works naturally with bag-of-words models, AFAIK it is the most used method of Chinese NLP projects.

We used THULAC analyzer and segment the entire corpus before tokenization. A slower but more popular tool is jieba.

Pros: It’s how human reads Chinese sentences. And after segmentation it is basically the same as in English. Most models that works for English corpora can be used to this Chinese corpus now.

Cons: The correctness of segmentation cannot be guaranteed. Even though many segmentation tools now claims to have 90%+ accuracy, but the evaluation corpora they used tend to be formal writings. We don’t have many training and evaluation datasets of much more noisy casual conversation and writings publicly available. Also, this method cannot handles unseen words and performs poorly on rare words.

Bojanowski et al. (2016) introduced subword units into word embeddings [1]. It represents a word with several character n-grams, which can overlaps in the original word. For example, the word <where> can be represented by these n-grams: <wh, whe, her, ere, re>.

What Sennrich et al. (2016) proposed is something simpler and requires no modification of the model structure[2]. It adapts Byte Pair Encoding (BPE) (Gage, 1994), a data compression technique[3], for word segmentation. It first splits the whole sentence into individual characters, then iteratively merges the most frequent pair of adjacent characters or character sequences into larger character sequences.

It’s a blend of character and word level encoding. Most common words will be represented as a single token, and others will be represented as a combination of several non-overlapping tokens. This will help the model handle rare and unseen words.

In the example above, the word <lowest> will be segmented as <low e s t ·>, where “·” is the special end-of-word symbol.

To put it into practice. SentencePiece from Google (not an official product) provides high-performance BPE segmentation and has a nice Python module:

You can check the README of SentencePiece as well as this script from our project to learn about how to fit the BPE model and performs segmentation:

Because BPE algorithm is very simple, some projects chose to implement it by themselves, e.g. the OpenAI generative pre-training model[4] which is covered in this post:

BPE is based on a greedy and deterministic symbol replacement, which can not provide multiple segmentation with probability. [5]

Kudo (2018) proposed a new subword segmentation algorithm based on a unigram language model [5]. It can be seen as a probabilistic mixture of characters, subwords, and word segmentation.

One important assumption that each subword occurs independently, which is highly unlikely because the occurrences of some subwords should be highly correlated. But this assumption allows us to formulate the probability of a subword sequence more easily as the product of the subword occurrence probabilities:

And because those subword occurrence probabilities are hidden variables, it uses the EM algorithm that maximize the following marginal likelihood:

If we read it from right to left, we’ll see that it first calculates the sum of probabilities of all segmentation candidates, takes log, and then sums the results from all sentences together.

(Honestly I haven’t fully understood this model yet. It seems that it’s not learning the actual subword occurrence probabilities, since it can be estimated just by counting. I couldn’t find an intuitive answer to what is actually learned by maximizing the marginal likelihood. Maybe it is related to my suspicion that segmentation candidates are being truncated or sampled since they grow exponentially with respect to the sentence length.)

Now we can do subword regularization by picking the l-best segmentation according to the conditional probability P(x|X) and samples from them in training time (and optionally, testing time).

Even using deterministically, the unigram algorithm to be better empirically than those from BPE (thanks to Jeremy Howard for the tip):

The “it” in the above tweet was referring to SentencePiece, which uses the unigram algorithm by default.

In the experiment word segmentation was performed before BPE or unigram algorithm. In doing so the encoding will be the same as in English.

But it doesn’t have to be. SentencePiece is efficient enough, so we can feed it the raw sentence and let it figure out how to segment without the intervention of a third-party word analyzer:

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese, Japanese and Korean where no explicit spaces exist between words.

This can be helpful both in terms of complexity of the pipeline and the internal noises. As we mentioned, word analyzer can be problematic, especially with copora that differs from the training dataset of that analyzer. Additionally, subword regularization could be more effective if we allow it to include the uncertainty in word segmentation.

I haven’t try using SentencePiece without pre-tokenization yet. Will try to do that in the next few development cycle.

Thanks for reading. The project did not progress fast enough partly due to me being on the road for half of September. When organizing the results of the experiments I thought a blog post introducing these approaches can be helpful to people. Please feel free to leave constructive criticism or give this post some claps if you fell it helped you. Thanks again!

  1. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors with Subword Information. [link]
  2. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. [link]
  3. Philip Gage. 1994. A New Algorithm for Data Compression. CUsers J., 12(2):23–38, February.
  4. Radford, A., & Salimans, T. (2018). Improving Language Understanding by Generative Pre-Training. [link]
  5. Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. [link]


Towards human-centered AI. https://veritable.pw

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store